Download Proposal Slides

Directions in Low-Power CAD Dennis Sylvester University of Michigan [email protected] http://vlsida.eecs.umich.edu With acknowledgements to: Prof. David Blaauw, Dr. Sarvesh Kulkarni, Saumil Shah, Kavi Chopra Topics  A new dual-Vth assignment formulation  Dual-Vdd power distribution  Approaches to parametric yield optimization: statistical leakage + delay Motivation  We require high-performance yet low-power circuits  Leakage power contributes significantly to total power  All High- Vth implementation too slow  All Low-Vth implementation too leaky S. Narendra et al [ICCAD ’03]  Dual- Vth processes popular  Problem Definition    Minimize  Total Circuit Power Subject to  Circuit Delay Constraint  Sizing Constraints Optimization Variables  Gate Sizes  Gate Threshold Voltages Switching Subthreshold leakage Gate Sizing + Vth Assignment Problem Prior Work  Traditionally a discrete problem  Previous approaches     Separate Sizing and Vth Assignment Mixed Integer Non-Linear Programming Sensitivity-based methods (DUET, etc) Continuous formulation [Chen, ASP-DAC ‘05]  Very reliant on discretization heuristic Proposed Approach – Selfsnapping formulation  Continuous formulation – Use of large variety of algorithms/powerful non-linear optimizers possible  Solution has almost all gates assigned to one of the two available threshold voltages  Small fraction of gates with intermediate Vth’s, can be handled heuristically  Discretization algorithm has negligible power impact and can be very simple Proposed Approach – Mixed- Vth Gates  Consider each gate to be a parallel combination of high and low Vth gates  RC Delay Model D  Reff Cl D=R eff C l R l / WRl RR h / Wh l h == CCl l RRl / W W +R / W +R W l hl hh lh HVt LVt C =C +K (W +W ) l Load SL l h  Linear Power Model HVt Gate P=PLVt +PHVt =Pl Wl +Ph Wh LVt Gate Mixed Gate Complete Dual- Vth Problem Formulation  Similar to single-Vth gate sizing problem, with simple gate delays replaced with High Vth/Low Vth parallel combinations  Minimize  Subject to:  Pl ,iWl ,i  Ph,iWh,i iG a j  A0 a j  Di  ai i  ({1,..., n}  {inputs}) j  {input (i )} Di  a i i  {inputs} 0  Wl ,i i  1,..., n. 0  Wh,i Li  Wl ,i  Wh,i i  1,..., n.  U i  1,..., n. i Proof of Discretized Solution  Conceptually separate optimization process into two distinct phases:   D-Phase : Fix delays of all gates W-Phase : Find the minimum-power sizing solution that satisfies the chosen D vector  Hypothetical separation for proof – Not implemented in actual optimization procedure W-Phase  Proof of discrete optimal solution under arbitrary D-vector sufficient  W-Phase formulation  Minimize  Pl ,iWl ,i  Ph,iWh,i iG  Subject to: Rl ,iWh,i  Rh,iWl ,i   Rl ,i Rh,i ( (Cinp , j (Wl , j Wh, j ))  K (W  W )) SL l ,i h,i j fanout (i ) i  1,..., n. Di Wl ,i  0 i  1,..., n. Wh,i  0 i  1,..., n. W-Phase  Linear programming problem  n basic variables, n non-basic variables  Therefore, only n non-zero variables  Every gate snapped to either high-Vth or low-Vth  Addition of upper and lower bounds on total size leads to some non-snapped gates  Number extremely small – simple heuristic achieves good results Practical Constraint – Fixed-Width Input Drivers  Sequential elements driving the combinational circuit  Delay of these elements affected by primary input widths  Modeled as fixed-width drivers Extension of Discretization Analysis  m+n constraints in the optimization problem  n+m basic variables, n-m non-basic variables  Therefore, n+m positive variables  Total number of non-snapped gates bounded by number of inputs   Once again, small in number; can be handled heuristically In practice, number of non-snapped gates found to be much less than the number of inputs Discretization Heuristics  Iterative snapping  Round gates to closer Vth and re-optimize until nonsnapped solution achieved  Single-pass Vth assignment  Fix all gates to closer Vth and re-optimize only for gate sizes  Second heuristic faster with negligible power impact Results c7552 c5315 c7552 c5315 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 2.0 0.0020 1.8 0.0018 1.6 0.0016 1.4 0.0014 1.2 0.0012 0.0010 1.0 0.0008 0.8 0.0006 0.6 0.0004 0.4 0.01 0.00 0 5 10 15 Timing Backoff(%) 20 0.0002 0.2 0.0000 0.0 c2670 c3540 c5315 c6288 c7552 i8 i9 i10 Circuit  # of non-snapped gates is very small  Dominated by gates at upper and lower size bounds  Approach is easily extendable to multi-Vth AND multiLgate -- % of total non-snapped gates due to input drivers 0.12 % of total non-snapped gates 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 % of Non-snapped gates due to fixed-width drivers % of non-snapped gates  Snapping properties of some circuits Results  Power and runtime comparisons between proposed approach and sensitivity-based algorithm at 2% timing backoff (results shown for larger circuits only)  Average: 31% leakage reduction vs. previous approaches SBA Continuous Formulation % Improvement Runtime(s) Ckt Static Dyn. Static Dyn. Total Static Total SBA Cont C3540 0.26 0.74 0.16 0.78 0.94 38.14 6.46 28 51 C5315 0.22 0.78 0.15 0.80 0.95 30.53 5.11 52 133 C6288 0.35 0.65 0.26 0.65 0.91 24.69 9.04 136 443 C7552 0.31 0.69 0.24 0.68 0.91 23.93 8.87 94 171 i8 0.24 0.76 0.19 0.75 0.94 21.57 5.87 24 35 i9 0.20 0.80 0.16 0.77 0.94 17.65 6.47 9 21 i10 0.31 0.69 0.23 0.69 0.92 24.8 7.69 287 373 Topics  A new dual-Vth assignment formulation  Dual-Vdd power distribution  Approaches to parametric yield optimization: statistical leakage + delay Multiple supply design FF VDDH VDDL FF DC Current FF FF VDDL Swing IN FF Need for Level Conversion  Relies on applying a lower supply (VDDL) to gates along non-critical paths thus reducing power while meeting timing  A flexible fine-grained VDD assignment scheme promises best power reduction  Gate-level Extended Clustered Voltage Scaling  However, physical design and power delivery are complicated Implications of using multiple supplies Critical Non-critical OUT Circuits IN Level shifting CVS ECVS Coupled Algorithms issues VDD assignment Physical design VDD Granularity Power delivery Distribution Generation Fine-grained Islanding Power delivery for dual-VDD circuits  Power grid integrity vital for circuit performance  Dual-VDD circuits require two supply voltages for operation  Fine-grained dual-VDD can place VDDL/VDDH gates arbitrarily on the die  Implications at the board, package and die level  Fixed resources need to be split between VDDL and VDDH  However, load on each supply is lower than on original single supply: Power supply current demanded by a dual-VDD circuit is significantly lower than the corresponding single-VDD circuit, allowing robust power delivery within available resources (decap, C4, wiring) Reduced current load on VDDL/VDDH  Gate level comparison  Avg. 54% (33%) for VDDL = 0.8V (0.6V) INVX10 NAND2X2 NAND3X6 NOR2X1 NOR3X4 AVERAGE Single-VDD Dual-VDD: VDDL=0.8V Dual-VDD: VDDL=0.6V Low-VTH High-VTH Low-VTH High-VTH Low-VTH High-VTH 1.00 0.90 0.57 0.49 0.36 0.27 1.00 0.85 0.54 0.45 0.34 0.23 1.00 0.88 0.55 0.47 0.35 0.24 1.00 0.86 0.52 0.39 0.30 0.19 1.00 0.85 0.50 0.37 0.29 0.18 1.00 0.88 0.54 0.44 0.33 0.23 VDD  Circuit level comparison  Avg. 49% (51%) and 28% (14%) for VDDH and VDDL for 0.8V (0.6V) Single VDD Dual VDD: VDDL=0.8V Dual VDD: VDDL=0.6V VDD VDDH VDDL VDDH VDDL c880 9.7 5.6 2.2 5.9 1.3 c2670 23.6 11.9 6.5 10.1 3.0 c5315 36.7 20.9 7.2 20.9 3.6 c7552 47.9 13.9 19.4 20.4 8.5 AVERAGE % 100.0 48.5 27.7 50.7 13.5 ECVS Package level results  Two VRMs on board to supply VDDL and VDDH  Ground path can be shared by VDDL and VDDH  Decoupling capacitance divided in the ratio of current loads Lmb1 + VDDH - VDDL + RblkH RhfH Rpkg_capH LblkH LhfH I(VDDH) Lpkg_capH CblkH - Rmb1 Lmb2 Rmb2 Lskt Rskt ChfH LpkgH RpkgH RdieH RhfL Rpkg_capL LblkL LhfL I(VDDL) Lpkg_capL CblkL ChfL 1 RdieL CdieL Cpkg_capL Lmb1 Rmb1 Lmb2 Rmb2 Lskt Rskt VDD or VDDH & VDDL PK QS VDDH Load CdieH Cpkg_capH RblkL 2 Single-VDD VDD Dual-VDD VDDH VDDL VDDL = 0.6V VDDL Load Dual-VDD 3 VDDH VDDL = 0.8V VDDL mV % mV % mV % mV % mV % 92.7 7.7 63.0 5.3 18.0 3.0 63.0 5.3 37.0 4.6 65.0 5.4 34.0 2.8 9.0 1.5 32.0 2.7 18.0 2.3 LpkgL RpkgL  Similar power supply noise with same resources as single-VDD case (decoupling capacitance, C4s) Intel, “Intel Pentium 4 processor in the 432 pin/Intel 850 Chipset Platform,” 2002. GND PK QS 92.7 7.7 68.9 5.7 68.9 11.5 77.8 6.5 77.8 9.7 65.0 5.4 40.7 3.4 40.7 6.8 46.0 3.8 46.0 5.7 Dual-VDD physical design alternatives Single-VDD Dual-VDD VDDH VDDL GND VDDH + VDDL row VDDH + VDDL row VDDH + VDDL row VDDH + VDDL row Dual-VDD segregated Dual-VDD segregated VDDH + VDDL row VDDH + VDDL row VDDH + VDDL row Dual-VDD fine-grained Segregated placement constrains placer leading to higher core area and wirelength C. Yeh, et al., “Layout techniques supporting the use of dual supply voltages for cell-based designs,” Proc. DAC, 1999. M. Igarashi, et al., “A low-power design method using multiple supply voltages,” Proc. ISLPED, 1997. Dual-VDD power grid alternatives  Routing the power supply rails Single-VDD Dual-VDD Shared-GND Dual-VDD Dual-GND Dual-VDD standard cells topologies 3-rail cell VDD GND VDDH VDDL GND (shared) VDDH GNDH VDDL GNDL VDDH VDDL GND (shared) 4-rail cell VDDH GNDH VDDL GNDL  Dual-VDD Dual-GND requires two separate grounds off-chip and complicates timing analysis and design of the board itself  Multi-rail standard cells can be used to realize the Dual-VDD grids  allows placer to operate with no constraints Dual-VDD on-chip power grid design  Guidelines while designing the dual-VDD grid:   Scale wires with respect to the single-VDD considering how the current demand has scaled VDDL gates more sensitive to grid noise  important since ground is shared    120mV noise is 10% for a 1.2V gate, but 20% for a 0.6V gate Placement of VDDL and VDDH gates  assign more wiring resources to VDDL grid in areas where there is more demand for VDDL current Consider effects that arise from the board and package level such as shared C4s  Fewer C4s leads to higher effective package R, L Proposed technique D-Place  Let  = I(VDDH)/I(VDD) and  = I(VDDL)/I(VDD)  Scale wires as follows  WVDDH   W Partition the chip floorplan Regional VDDH W VDDL VDDH      W VDDL Global WVDDL   WGND Original Single VDD design Obtain Dual VDD design Local Obtain current consumption of Single/Dual VDD designs (SPICE) VDDH Single VDD Lib file Dual VDD Lib file Measure voltage droop/bounce Measure wire congestion Placement database (Cadence) Size each wire segment in each local area using effective , β & simulate grid Break down die into “local” & “regional” areas   effective  GND Obtain eff.  and  as follows Arealocal Arealocal   global Arearegional Areaglobal Arealocal Arealocal 1  Arearegional Areaglobal  local   regional Calculate local, regional, global & effective  &  for each wire segment VDDL Peak voltage drop comparisons VDDL = 0.6V c880 c2670 c5315 c7552 MAX AVG MAX AVG MAX AVG MAX AVG Single VDD 16.9% 9.5% 25.6% 15.9% 29.6% 21.6% 26.8% 22.2% DVDG 30.9% 14.7% 35.5% 19.8% 38.2% 23.4% 34.2% 21.0% VDDL = 0.8V D-Vanilla D-Place 16.4% 18.6% 9.6% 9.5% 32.2% 25.5% 15.2% 14.5% 37.4% 32.0% 20.2% 19.8% 34.5% 29.4% 21.1% 18.7% c880 c2670 c5315 c7552 MAX AVG MAX AVG MAX AVG MAX AVG Single VDD 16.9% 9.5% 25.6% 15.9% 29.6% 21.6% 26.8% 22.2% DVDG 30.3% 15.9% 36.1% 22.1% 38.1% 25.4% 31.4% 24.9% D-Vanilla D-Place 16.3% 19.5% 9.7% 9.8% 27.6% 27.0% 15.8% 15.3% 33.0% 31.8% 20.1% 20.3% 31.6% 28.7% 22.3% 20.1%  D-Place grids better than single-VDD grids in AVG cases  Inferior by < 2.6% (≈15mV) in some MAX cases  0.6V VDDL as robust as 0.8V  0.6V also provides higher power savings  Proposed approach better by 2-7% (AVG) and 7-12% (MAX) compared to prior approaches Voltage variation across die  Voltage drop contours Single VDD grid D-Place Dual VDD grid 0.7 0.7 Y Axis (mm) 0.5 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.1 0.2 0.3 0.4 X Axis (mm) 0.5 0.6 0.7 12.00 14.00 16.00 18.00 20.00 22.00 24.00 26.00 28.00 0.6 Y Axis (mm) 15.00 16.25 17.50 18.75 20.00 21.25 22.50 23.75 25.00 0.6 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 X Axis (mm)  Wiring congestion similar for dual-Vdd vs. single Vdd grids  Lower current demands can lead to smaller amounts of decoupling cap; lower leakage (or use same decap for better performance) Dual-VDD grid no less robust than single-VDD grid Topics  A new dual-Vth assignment formulation  Dual-Vdd power distribution  Approaches to parametric yield optimization: statistical leakage + delay Introduction Optical Proximity Effects Variation Chemical Mechanical Polishing Variations Low Leakage PoorTiming Timing Yield Loss P P Good Timing High Leakage Power Yield Loss Process Parameter-space Chip Performance-space This Work: Optimize the timing and power yield using gate sizing Problem Description  Nonlinear Continuous Optimization Objective: Maximize Timing and Power Yield Yield: A utility function defined w.r.t the JPDF of leakage and timing Decision Variables: Gate Size   Pconst Tconst  Efficient implementation requires   Computing yield as function of decision variables - gate size Fast and Accurate Gradient computation Power and Timing Yield Analysis (see DAC05 for more detail) Timing Analysis [Sapatnekar03, Chandu05] (d, d) n Delay  d 0   d i X i  d n 1 R i 1 d Delay Correlation (1 parameter) n    d i li i 1 Power Analysis (l, l) n log Leakage  l0   li X i  ln 1 R i 1 l Delay and Power Bivariate JPDF (d, d, l, l,  ) Log(Leakage) Cut Set SSTA: Intuition  Consider Timing Graph Cut Edge Time(CT)  Arrival Time (AT) Required Arrival Time (RT) Unperturbed Sub Graph 2 6 9 Unperturbed Left Sub Graph 1 8 3 Unperturbed Right Sub Graph 10 4 7 Size Up 7 5 Traditional Incremental Timing Max Cut Edge Time (CT)  If Forward SSTA  Reverse SSTA then Cut Set SSTA will give exact same sensitivities as naïve approach that recomputes yield relating to all nodes, most being unchanged Statistical Yield Optimization Results D < Dμ,initial , P < Pμ,initial Circuit Yield without L (%) Yield with L (%) c432 45.4 80.2 c499 39.2 59.0 c880 49.3 83.2 c1908 47.9 82.8 c2670 51.1 85.3 c3540 51.2 87.1 c5315 50.0 87.3 c6288 50.3 86.5 c7552 51.2 80.8  Initial yield ~0-2% due to inverse correlation  Gate sizing alone provides good improvements  Combined with Lgate biasing, provides outstanding results Chopra, et al., ICCAD05 Another approach to statistical optimization  General statistical optimization  Method relies on efficient deterministic formulations and variation space sampling to drive statistical optimization  Applicable to many mainstream VLSI design problems: gate sizing, Vth assignment, Leff biasing as well as potential new levers Statistically Optimized Body Bias Clustering for Post-Silicon Tuning  Concept: Speed up critical gates using FBB and slow down non-critical gates using RBB to meet timing and power constraints Vth  Vth0     2F  Vsb  2F  Critical Non-critical  Traditional view: Centralized body bias generator controlling different die regions   Ineffective for compensating intra-die variations Highly suboptimal power BB controller Coarse Body Bias Assignment ONE BIAS FOR ALL GATES 600 Frequency Critical Frequency 500 400 300 200 100 0 0.064 Correlated 0.068 0.072 160 140 120 100 80 60 40 20 0 0.2 DELAY 0.4 0.6 POWER  Simplified assignment minimizing routing overheads  Biasing dictated by placement instead of gate criticality  Disregards complex dependence of gate criticality on:   Circuit topology Correlations in process variations  Effective in tightening delay but leads to high power  Important to cluster gates to leverage ABB effectively 0.8 1.0 Proposed New Optimization Framework Generate sample scenarios Leff_4.1 Solve BB assignment for each scenario Scenario ‘1’ Generate PDFs of optimal actions Gate 4 7Scenario ‘2’ Leff_4.2 Leff_5.1 4 3 Leff_7.1 7Scenario ‘x’ Leff_3.1 Leff_4.x 5 3 Leff_5.2 Leff_2.1 Leff_7.2 4 Leff_1.1 Leff_3.2 7 Leff_6.1 5 2 Leff_2.2 3 Leff_5.x 6 Leff_7.x 1 Leff_1.2 Leff_6.2 Leff_3.x 5 2 Leff_2.x 6 1 Leff_1.x Leff_6.x 2 DETERMINISTICALLY optimize each scenario (i.e., tune each gate for each die scenario) 6 1 Post-silicon tuning Clustering 20 Dual-Vth design Proposed work Power [uW] 15 10 Timing target 5 0 0.55 0.60 0.65 0.70 Delay [ns] 0.75 BB-PDF ρi,j Results vs. Traditional Dual-Vth  Leakage power   Delay Dual-Vth vs. 2-4 ABB clusters  Avg. 28-38% (51-59%) lower μ  3-9X tighter σ (95th)  Area  Capo generates contiguous regions of similarly clustered cells while minimally displacing cells  5-8% increase in wirelength and area A few conclusions  Parametric yield is a critical design objective going forward   Requires accurate estimation and fast optimization approaches to this key metric Envision all tools in 4-6 years being yield-driven, rather than timing or power alone  Lots of room for improvement in many ‘well-studied’ CAD problems today  Recent examples; dual-Vth+sizing, placement (Cong, et al)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Proposal Slides