Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li Key Laboratory of Computer System and Architecture, ICT (Institute of Computing Technology), CAS, Beijing, P.R. China NVIDIA Corporation, USA Outline • • • • • What’s Path-grained Timing Adaptability (PTA) Potential of PTA for Efficiency Improvement How to Exploit PTA Case Study Results Conclusions Impact of DVFS to Path Delay T T P1 FF T Cycle Period Critial Path P2 FF Non-critical Path K-1 • th stage th K stage Traditionally, suppose voltage scaling down makes P1 and P2 timing critical, then what? • Scaling down frequency to all stages of pipeline Question: • Can these emerging critical paths be salvaged to trade more voltage scaling down? • Maybe Yes! By fine-grained time stealing Timing Imbalance Generous Flip-flop (GFF) T T FF T Cycle Period Slack_up > TH, Slack_dn > TH PCP Forward Adaptable Flip-flop (FAFF) FF NCP Slack_up > TH, Slack_dn ≤ TH Backward Adaptable Flip-flop (BAFF) FF Slack_up ≤ TH, Slack_dn > TH Unadaptable Flip-flop (UAFF) FF Slack_up ≤ TH, Slack_dn ≤ TH Intrinsic Timing Imbalance • Case study • • • FPU, adopted by OpenSPARC T1 Support all IEEE 754 floating-point data types Synthesized by Synopsys Design Compiler with UMC 0.18um technology Cycle period: (1+10%) ×T critical • 10000 # Flip-flops 1000 100 The GFFs, FAFFs, and BAFFs take considerable even dominated proportion! 10 GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF 1 TH=0.1 Cycle TH=0.15 Cycle TH=0.2 Cycle TH=0.25 Cycle TH=0.3 Cycle Attractive Potential TH=0.35 Cycle TH=0.4 Cycle DVFS Exacerbating Imbalance • Generally, the time margin of longer paths diminish much more faster than that of short ones Slack_up Example n gates Slack_dn FF m gates Define: △S=|Slack_dn - Slack_up| • Assume that the path delay is the sum of delay of gates on the path • • • Before voltage scaling down • • TG : the gate delay Delta: the delay change during the voltage scaling down △S1 = (n-m) × TG After voltage scaling down • △S2 = (n-m) × (TG + Delta) △S1 < △S2 If the Imbalance be utilized… • Check the lower bound of cycle period T • Traditionally: T1 = n× (TG+Delta) • From MicroFix’s perspective: T2 = (m+n)/2 × (TG+Delta) ≤ T1 - TH Note: preclude the UAFFs F T Without MicroFix n (m+n)/2 2/(m+n) F=1/T δ= δ(V) With MicroFix δ With MicroFix 1/n Without MicroFix 1/V How to deal with UAFFs? • Two-supply voltage scheme [Usami, JSSC’98] [Ghosh, TCAD’07] • Critical Isolation: the critical paths resulting in UAFFs • The supply voltage of Critical Isolation are more conservative than that of other portion out of Critical Isolation. The exploitable scope of MicroFix Critical Isolation Powered by Conservative Voltage Powered by Aggressive Voltage How to “Fix’’? • • • Two supply voltage scheme Timing sensors [Yan, DATE’09][Agarwal, VTS’07] Multiple-phase Clocks (generated by a DLL) Delay Error Prediction Signals Timing Sensors UAFF FAFF BAFF Timing Sensors GFF FFs Normal Voltage Supply K FFs … … Conservative Voltage Supply FCLK Kth stage Logic … … (K-1) FFs … … (K-1)th stage Logic … … …… … … Voltage/ Frequency Control …… Target Pipeline …… CLK …… BCLK …… CLK T×TH BCLK CLK FCLK T×TH FCLK BCLK Operational Principles Reducing Power Reduce voltage from V to V Reduce frequency from F to F V, F V, F Increase frequency from F to F V, F Increase voltage from V to V Increasing Performance Ensure that the restored margin ‘v’ and ‘f ’ can (a) Traditional DVFS and frequency turning. guard safe voltage Reduce voltage from V to V Reduce frequency from F to F V V-v V, F No error predicted F F- f Restore a tight margin Monitoring Error predicted V, F Error predicted Monitoring V No error predicted F F+f Increase voltage from V to V Increase frequency from F to F (b) MicroFix enhanced DVFS Restore a tight margin V+ v V, F Experimental Setup • Gate-level • Study the adaptability and overhead with a synthesized FPU – Timing info. -> PrimeTime • Transistor-level • Investigated the Power-Performance tradeoffs with Hspice simulations – 32nm PTM models dedicated for HP and LP applications, respectively. Exploring Design Tradeoffs • ‘TH’ play a critical role in determining the ultimate Efficiency What ‘TH’ is optimal? 10000 # Flip-flops 1000 100 10 GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF GFF FAFF BAFF UAFF 1 TH=0.1 Cycle TH=0.15 Cycle TH=0.2 Cycle TH=0.25 Cycle TH=0.3 Cycle TH=0.35 Cycle TH=0.4 Cycle Smaller ‘TH’, smaller CI, but less Larger ‘TH’, larger CI, but more aggressive voltage reduction! aggressive voltage reduction! The exploitable scope of MicroFix Critical Isolation The exploitable scope of MicroFix Critical Isolation Exploring Design Tradeoffs /2 Percentage of Cells in Critical Isolation 40% 33.52% 35% Percentage of Cells • 30% 22.70% 25% 20% 15% 10.82% 10% 5% 0.00% 0.00% 0.16% 0.1 0.15 0.2 2.04% 0% 0.25 TH 0.3 0.35 0.4 Exploring Design Tradeoffs /3 Sensor Area Overhead • • a sensor is about 8x that of a pipeline flip-flop (based on the number of transistors) [Yan, DATE09] The paths in the critical isolation and those with ‘over-larger’ slack (i.e. slack >T × TH + tmargin) do not need to be monitored by sensors 14.0% Sensor area overhead • 12.34% 10.95% 12.0% 9.20% 10.0% 9.97% 8.0% 6.0% 3.75% 4.0% 2.0% 2.10% 0.00% 0.0% 0.1 0.15 0.2 0.25 TH 0.3 0.35 0.4 Exploring Design Tradeoffs /4 • Sensor Power Overhead • in the most pessimistic case (TH=0.3, all sensors simultaneously flag timing errors): 14% • HOWEVER, such worst-case power overhead can hardly happen due to three reasons 1) Sensors do not need to be always on 2) It’s almost impossible all sensors flag impending timing errors simultaneously 3) TH=0.3 actually is not a optimal configuration Therefore, the pessimistic power overhead won’t offset much efficiency of MicroFix! Hspice Simulations • • • Object: Investigate the detailed delay-power relation of the target pipeline It is ideal to directly simulate the transistorlevel model of the target pipeline with Hspice; however it is very labor-intensive and time consuming. So we took a indirect way to conduct the Hspice simulations Ptotal(V,F) = Pcomb(V,F)+Pff(V,F) 1/F = T = tc + tsetup + tc−to−q Combinational Component ISCAS85 (c432, c499, c880, c1355, c1908, c2670) 32nm PTM models (HP and LP versions) 6 1 5.5 0.9 5 0.8 4.5 4 Low Power 3.5 3 2.5 High Perf. Normalized Power Noramlized Delay • • 0.7 0.5 0.4 0.3 0.2 1.5 0.1 1 0 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 Voltage (V) (a) Normalized V-D and V-P relations comply well with all of the simulated benchmarks! Low Power 0.6 2 1 High Perf. 1 0.95 0.9 0.85 0.8 0.75 Voltage (V) (b) 0.7 0.65 0.6 Sequential Component 6 5.5 t_setup + t_c-to_q 5 t_setup + t_c-to_q Normalized Power Normalized Delay 1 0.9 4.5 4 3.5 3 Low Power 2.5 High Perf. 2 Low Power 0.5 0.4 0.3 0.2 α=1 α=0.5 α=0.25 α=1 α=0.5 α=0.25 0.1 0 1.5 1 1 0.9 0.8 Voltage (V) (a) • • High Perf. 0.8 0.7 0.6 V-D V-P 0.7 0.6 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 Voltage (V) (b) Efficiency Comparsion Sensor area overhead 14.0% 12.34% 10.95% 12.0% 9.20% 10.0% 9.97% 8.0% 6.0% 3.75% 4.0% 2.0% 2.10% 0.00% 0.0% 0.1 0.15 0.2 0.25 TH 0.3 0.35 TH = 0.2 is an optimal choice! Efficiency Improvement: 35% EDP, 28% PDP 0.4 Conclusion • • • MicroFix can improve DVFS efficiency by exploiting the path-grained adaptability The timing imbalance threshold, TH, implies a critical design tradeoff The efficiency of EDP for HP application up to 35% and PDP for LP application up to 28%, at the expense of only 7% area overhead Thanks! Q&A