* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Accumulation Model
Power factor wikipedia , lookup
Standby power wikipedia , lookup
Buck converter wikipedia , lookup
Electrical substation wikipedia , lookup
Wireless power transfer wikipedia , lookup
Voltage optimisation wikipedia , lookup
Power over Ethernet wikipedia , lookup
Power electronics wikipedia , lookup
Power MOSFET wikipedia , lookup
History of electric power transmission wikipedia , lookup
Thermal copper pillar bump wikipedia , lookup
Audio power wikipedia , lookup
Electrification wikipedia , lookup
Electric power system wikipedia , lookup
Amtrak's 25 Hz traction power system wikipedia , lookup
Rectiverter wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Thermal runaway wikipedia , lookup
Mains electricity wikipedia , lookup
Power engineering wikipedia , lookup
Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California, Los Angeles [email protected] This work is supported in part by NSF Variability Expedition grant CCF-1029030 1 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 2 Hardware Reliability Margin Parametric margin Physical margin Voltage/Frequency or sign-off corners E.g., BTI, HCI Metal width, layout spacing E.g., current-dependent minimum metal width for EM Typically worst-case driven Mostly derived at hardware design time Uncertainty in workload, circuit operating points etc. 3 Reliability vs. Operating Points Most reliability-related phenomena depends heavily on the circuit operating points Voltage, Frequency, Temperature etc. 4 Dynamic Range of Operations Efficiency needed for the Dark Silicon Era Multi/Many-core design with less powerful cores Low voltage/current/power -> less margin “Turbo X”: Turbo Boost (Intel), Turbo Core (AMD) Under certain conditions High voltage/current/power-> more margin Moderate Parallel Known optimistic Low stress states Workload Reliability margin High stress states Intensive Single-thread Known pessimistic 5 Dark Silicon Contexts Pessimism depends on the difference between peak power/temperature and sustainable power/temperature Power constraint Quantify silicon “darkness” Dark ratio: Limit on maximum instantaneous power Thermal constraint Limit on maximum on-chip temperature 6 Margining Methodology Formulate as workload optimization Maximize the reliability degradation Still meets the power/thermal constraints 7 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 8 Dynamic Reliability Model Most reliability models are static Derived for constant voltage/current/temperature Need a highly dynamic model for optimization Comparing different degradation scenarios v v P1 P3 P1 vs. P3 t P2 t 9 Accumulation Model Time spent in each power states Some can be derived from the model itself Accumulation Model Worst-case degradation at the end of lifetime E.g., EM can be modeled by effective current density Jeff Other can be derived by simulator E.g. Worst-case BTI degradation can be derived by simulating different power state ordering and picking the worst-case Fitting and interpolation can also be used 10 Spatial problem vs. Temporal problem With accumulation model, reliability degradation can be modeled as temporal distribution problems v P1 P2 P3 t The workload and power/thermal constraints are spatial problems P1 P3 P1 P2 P1 P2 11 System Management Policy We assume a fair round-robin policy Iterate scheduling priorities among all processor cores Iterating frequency can be of hours to days Assuming this policy because: Simple: open-loop, reasonable to assume at hardware design time Effective: sufficient iterations to balance workload during typical hardware life time of multiple years Pessimistic: more sophisticated policies are likely to perform better, i.e., margin is pessimistic 12 Bridging Spatial and Temporal Problems Management policy will iterate workload among all cores Spatial distribution is equivalent to temporal distribution v P1 P3 P1 P2 P1 P2 P1 P2 P3 t Spatial constraints Temporal distribution 13 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 14 Optimization Under Power Constraints x is the number of cores at each power states Also the input to the accumulation model f(x) P is the power corresponding to the power states Pmax is the power constraint Formulated as Integer Linear Programing (ILP) problem 15 Thermal Problem Thermal limit can be reached by two scenarios Heat up then cool down (left) Constant temperature (right) The constant stress will result in worse degradation Higher average temperature More time in high power state 16 Optimization Under Thermal Constraints S is time spend in each power states for each cores A is the temperature sensitivity matrix Temperature increase per unit power Tmax is the maximum temperature constraint Tbak is the background power for each cores Formulated as Linear Programming (LP) problem 17 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 18 Experimental Setup Power model Thermal model Based on a commercial processor benchmark Using libraries characterized at different supply voltages from 0.6V to 0.9V Using HotSpot simulator Consider the cases of 2x2, 4x4, 8x8 and 16x16 cores BTI: both NBTI and PBTI EM: metal sized to have the same current density (MTTF) 19 Local Power Network EM Results Power constraint 40% reduction Thermal constraint 20 Signal Wire EM Results Power constraint 60% reduction Thermal constraint 21 BTI Results 20% reduction Power constraint Thermal constraint 22 Conclusion We propose hardware reliability margining methodology for chips in the dark silicon era We formulate the margining problem under power and thermal constraints Experimental results show that at 60% dark ratio, our method can achieve 40%-60% reduction in metal width margin and 20% reduction in BTI delay margin 23 Backup slides 24 EM Accumulation Model Effective current density: For local power mesh Jeff can be calculated by average power consumed For signal wires: Jeff is proportional to V * f 25 BTI Accumulation Model Two steps: Identify the worst-case ordering by simulator Worst BTI degradation happen when power states are applied in increasing order of stress voltages Fitting the accumulation model First pick a set of power state distribution sample x Simulate the degradation g(x) Assuming the fitting function is Formulated as: 26