Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Energy Efficient Designs with Wide Dynamic Range Vivek De Intel Fellow Director of Circuit Technology Research Circuits Research Lab Intel Labs MID Mobile Desktop Server Platform segment characteristics Power High Med Low Very low Workload V-F Cores Thermal Load Active High Wide Burst Serial-Parallel Med/High Stream Wide Graphics Fan (-less) Med-High Wide Burst Serial-Parallel Med/High Stream Wide Graphics Fan (-less) Throughput Parallel HPC-RAS Med/High Wide Burst Serial-Parallel Low/Med Stream Wide Graphics Passive Fan-less Med-High Wide Low-Med Wide SOC Few designs must serve all segments 2 2 Ambient Controlled Cool Controlled Varying Uncontrolled Extreme Uncontrolled Extreme Dynamic platform control Cache Cache Special Purpose Engines Graphics Cache Video Scalable Scalable On-die On-die Interconnect Interconnect Fabric Fabric Last Level Cache Maximize performance & efficiency Last Level Cache Independent V/F control regions Integrated Memory Controllers Last Level Cache Scenario-based power allocation Off Die interconnect Dynamic V/F control Workload-based core activation & shutdown Deliver best user experience under constraints 3 3 Dynamic Control Unit Dynamic adaptation & reconfiguration Voltage Control Frequency Control Configuration Control Processor Change V Change F Reconfigure Sensors Thermal Sensor Voltage Sensor Current Sensor Aging Sensor Adapt & reconfigure for best power-performance 4 4 Fault confinement Error correction System recovery System adaptation System reconfiguration Applications Programming System OS VM Firmware Microcode Microarchitecture Circuit & Design Less silicon overhead Fault diagnosis Resiliency framework Lower error rate Error detection Less recovery overhead Resilient platform features Resilient platforms Resiliency for performance, efficiency & reliability 5 5 Fine-grain power management TFLOPS at 62 watts Temporal domain Slow voltage & frequency change Energy Coarse-grain management user-demand delivered TODAY 0.0 FUTURE 1.5 Time (msec) • V/F change latency • Active-sleep transition latency • Wake-up latency • Wake-up prediction • State transition energy • Supply noise Fast voltage & frequency change user-demand delivered Energy Fine-grain management 1.0 0.5 0.0 1.0 0.5 Time (msec) 6 6 1.5 Fine-grain power management Spatial domain Fine-grain management Coarse-grain management • Each core/cluster at optimum voltage • Each core/cluster at optimum frequency • Same voltage to all cores • Same frequency for all cores • V/F domain interfaces • Synchronization overhead • Clock generation/distribution • Power grid routing • Optimum V/F for non-cores • Sub-core clock/leakage gating FUTURE TODAY 7 7 Advanced voltage regulators Conversion Distribution •Area efficient •Scalable •Persistent rail •Lower loss •Higher fidelity •Simpler Efficient Control •Fast & efficient •Load adaptive •Independent rails Low loss Fine-grain Efficiency Load chip RF Launch 5mm TOP BOTTOM Input Capacitors Load Adaptive Power Source 80 Conventional Power Source 60MHz 2.4V to 1.2V 80MHz Output Capacitors 2.4V to 1.5V 85 75 Inductors ηPS L = 1.9nH L = 0.8nH Efficiency [%] Converter chip 90 100MHz 70 0 5 10 15 Load Current [A] 20 ILOAD VR innovations for fine-grain power management 8 8 Research testchip 12.64mm I/O Area 1.5mm 2.0mm MSINT MSINT 39 39 Crossbar Router MSINT single tile Mesochronous Interface 20 GB/s MSINT 2KB Data memory (DMEM) 3KB Inst. memory (IMEM) 21.72mm PLL I/O Area TAP RIB 64 96 64 64 32 32 6-read, 4-write 32 entry RF 32 32 x x 96 + 32 Normalize FPMAC0 Tile 9 9 + 32 Normalize FPMAC1 Processing Engine (PE) Many-core power management 21 sleep regions per tile (not all shown) Data Memory Sleeping: 57% less power Instruction Memory Sleeping: Dynamic sleep FP Engine 1 Sleeping: 90% less power 56% less power STANDBY: • Memory retains data • 50% less power/tile FULL SLEEP: • Memories fully off • 80% less power/tile Router Sleeping: 10% less power (stays on to pass traffic) FP Engine 2 Sleeping: 90% less power Energy efficiency of 19.4 Gflops/Watt 10 10 Memory integration 3D stacking with thru-silicon vias 256 KB SRAM per core 4X C4 bump density 8490 thru-silicon vias Processor with Cu bump Polaris Processor denser than C4 pitch DRAM Memory C4 pitch Thru-Silicon Via Package Package 11 11 Voltage-frequency range limiters Vmax/Fmax limiters • Reliability • Thermals • Power delivery Fmax Voltage Vmax Vmin Frequency Vmin limiters • • • • Circuit functional failures Soft errors Steep frequency roll-off Aging Reliability & functional failures limit range 12 12 Voltage-frequency margins V variation T variation IR drop Inductive droops Load line variations Worst T Frequency MIS Signal coupling Path activity Worst Critical path F margin Voltage V margin Aging Nominal T EOL V margin Frequency Voltage V margin F margin Voltage Voltage V margin F margin F margin Nominal BOL Frequency Frequency 13 13 Multi-voltage cache 6T SRAM Vmin Cumulative fail rate Array Vmin SER, erratic bits Nominal array Vmin Worst die Vmin Multi-V 6T SRAM 8T+ cell LLC density Density Array voltage Push active Vmin limit to Vmax Embedded level shifters for wordline & write drivers minimize area & power overhead 14 14 Dynamic multi-voltage cache Wordline underdrive for read BL# 6T SRAM cell BL VCC Array to WL differential supply noise tracking VCC NX Weak Write Strong VOPAMP TD R1 NPD Read WL PPU NX WL Dynamic voltage collapse for write VCC _ Retention ) VSS + R2 TM /RC M = VCC(1-e- VWL C Duration Strong Weak Balanced PPU Strong Weak Balanced Pulse width control min-cells 128Kb min-cells 128Kb 0.7 PBIST block p-cells 128Kb p-cells 128Kb 128Kb p-cells 107Kb p-cells min-cells Testing interface 1.1V-0.7V membrane probe card 128Kb Vcc WLUD & dummy WLs 0.91mm2 IO pad drivers Chip Area R MIN cell Source: Intel 6.2M Magnitude TM 45nm dynamic multi-V testchip Transistor count 0 TD VWL(V) 0.8 0.9 1 1.1 DVC off 1.E+10 Weakest DVC 1.E+08 Strongest DVC 1.E+06 1.E+04 Nominal DVC 1.E+02 26X less fails 15 15 1.E+00 Relative Single Bit Fails NPD Cache reconfiguration Reduce cache size @ low V/F by eliminating failing words/bits Bit fix Word disable 00101000 Failing words Bitmap of failing words Supply Voltage 0.4 0.45 0.5 0.55 0.6 0.65 1.0E+00 Failure Prob. 1.0E-03 1.0E-04 Word disable 1.0E-05 1.0E-06 1.0E-07 1.0E-10 Density Capacity Latency (cycles) IPC EPI Conventional 660 1* 1* (8-way) L1: 3 L2: 20 1* 1* 32KB L1 Word disable 500 0.92 0.5 (4-way) 4 0.95 0.5 2MB L2 Bit fix 500 1 0.75 (6-way) 23 0.95 0.5 1-bit ECC 1.0E-02 1.0E-09 Vmin (mV) Source: Intel 1.0E-01 1.0E-08 0.7 Bit fix 10-bit ECC Source: Intel * Normalized reference value 1.0E-11 16 16 Low-voltage logic design Narrow muxes No stack height > 2 Robust flip-flops Robust level converters vcch “1” vcch vcch “1” “1” vcch vccl “0” “0” “0” input 4:1 Mux Improve range MIPS/Watt Impact max performance Efficiency: Efficiency: MIPS/Watt Design & technology optimizations to balance range, performance & efficiency Improve range & efficiency Performance Performance 17 17 output Low-voltage motion estimation engine Motion Estimation Engine I/O Control 65nm CMOS 70K transistors Die area ~1mm2 Output FIFO Input FIFO 10000 100 450 Wide V-F range Fmax (MHz) 10 100 1 10 GOPS/Watt 400 1000 0.1 0.4 0.6 0.8 1 1.2 412 Gops/Watt 300 @ 320 mV! 250 9.6X 200 150 Source: Intel P1264, 50°C 50 0 0.01 0.2 350 100 Source: Intel 1 Clock Generator 0.2 1.4 0.4 0.6 0.8 1 Supply Voltage (V) Supply Voltage (V) 18 18 1.2 1.4 F0 PLL0 F1 Clocking PLL1 F2 Div gate I/O clk PLL2 core clk CLOCKING Dynamic V & F adaptation Noise gen Output port Input buffer TCP/IP Processor Core VRM Noise gen ctrl TCP/IP processor Source: Intel Noise injector PMOS body bias PMOS CBG NMOS CBG Thermal sensor Time DAB Control Droop sensor Prototype chip in 90nm Vcc PLL command JTAG CONTROL NMOS body bias DAB Temp Sensors Input & Analog Buffer 2nd droop Source: Intel 3rd droop 1 st droop Time Environment-aware • Adapt F/V to V/T change reduce V/T margin dynamic adaptation • Adapt F/V to aging reduce aging margin 19 19 Resilient circuits Error Detection Sequential (EDS) MSFF MSFF E rro clk • • • • Error det. Voltage droop r! clk Detect errors in critical path FFs Propagate error signals Correct errors by re-execution Feedback to adaptive V/F 65nm resilient circuits testchip 20 20 Resiliency experiments Response to voltage droops 1.0E-01 3.0 1.0E-04 Max TP Resilient Design Max TP 1.0E-07 Source: Intel 1.0 Power (mW) 4.0 2.0 Conventional Design 200 1.0E+02 VCC & Temperature FCLK Guardband Error Rate (%) Throughput (BIPS) 5.0 Nominal: VCC=1.2V & Temp=60C Worst-Case: 10% VCC Droop & Temp=110C 2400 2700 3000 3300 150 100 Conventional Design Resilient Design 50 1.0E-10 0.0 2100 21% Throughput Gain OR 37% Power Reduction Source: Intel 0 1.0E-13 3600 1.0 Clock Frequency (MHz) 1.5 2.0 Throughput (BIPS) 21 21 2.5 3.0 Summary • Energy efficiency and wide dynamic operating range are critical for all platforms • Integration, fine-grain power management, advanced voltage regulators & 3D memory stacking are key for energy efficiency • Reliability, functionality, margins & efficiency limit dynamic operating range • Multi-voltage design, dynamic adaptation, reconfiguration & resiliency are key enablers 22 22 Acknowledgement CRL prototype Tomm Aldridge Jerry Bautista Keith Bowman Lev Finkelstein Marci Glenn Matthew Haycock Andrew Henroid Steven Hsu Alaa Alameldeen Sean Koehl Sanu Mathew Alon Naveh Clark Roberts Mark Rowland Joe Schutz Sriram Vangal Bibiche Geuskens Clair Webb Bangalore Design Mark Anders Nitin Borkar Saurabh Dighe Shih-Lien Lu George Goodman Chris Wilkerson Yatin Hoskote Tanay Karnik M. Khellah R. Krishnamurthy Randy Mooney Trang Nguyen Ronny Ronen Greg Ruhl D. Somasekhar Manny Vara Steven Hsu Mark Bohr LTD Design Paolo Aseron Shekhar Borkar Zeshan Chishti Varghese George Steve Gunther Ming Zhang Jason Howard Himanshu Kaul V. Erraguntla Partha Kundu A. Raychowdhury Fabrice Paillet Erfaim Rotem Gerhard Schrom James Tschanz Howard Wilson Kevin Zhang Gunjan Pandya 23 23 Murli Tirumala Ravi Mahajan Ravi Prasher Venkat Natarajan Anand Deshpande Ali Farhang Amit Agarwal Robert Chau Rajesh Kumar Ketan Paranjape Greg Taylor P. Vishakantaiah R. Kuppuswamy Rohit Vidwans Gautam Doshi Sunit Tyagi B. Chatterjee Sati Banerjee Q&A 24 24