Download ARM General Purpose Processor

1 Technologies for Reducing Power Trevor Mudge Bredt Family Professor of Engineering Computer Science and Engineering The University of Michigan, Ann Arbor SAMOS X, July 18th 2010 1 Technologies for Reducing Power 2  Near threshold operation  3D die stacking  Replacing DRAM with Flash memory ACAL – University of Michigan 2 2 Background   3 Moore’s Law  density of components doubles without increase in cost every 2 years  ignoring NRE costs, etc.  65➞45➞32➞22➞16➞11➞8nm (= F nanometers)  Intel has 32nm in production Energy per clock cycle IleakVdd E  CV  f 2 dd   Vdd is the supply voltage, f is the frequency, C capacitance and Ileak is the leakage current Vth is the “threshold” voltage at which the gate switches   e.g. Vth ≈ 300 mV and Vdd ≈ 1V ACAL – University of Michigan 3 3 The Good Old Days—Dennard Scaling 4 If s is the linear dimension scaling factor (s ≈ √2)  Device dimension tox, L, W  Voltage V  Current I  Capacitance εA/t  Delay VC/I  Power VI  Power density VI/A ACAL – University of Michigan 1/s 1/s 1/s 1/s 1/s 1/s2 1 4 4 Recent Trends Circuit supply voltages are no longer scaling… 5 Therefore, power doesn’t decrease at the same rate that transistor count is increasing – energy density is skyrocketing! Stagnant Shrinking Dynamic dominates CVdd2 IleakVdd U  A Af CVdd2 IleakVdd U  A Af  A = gate area  scaling 1/s2 C = capacitance  scaling < 1/s The emerging dilemma:  More and more gates can fit on a die, but cooling constraints are restricting their use ACAL – University of Michigan 5 Impact on Dennard scaling 6 If s is the linear dimension scaling factor ≈ √2  Device dimension tox, L, W  Voltage V  Current I  Capacitance εA/t  Delay VC/I  Power VI  Power density VI/A ACAL – University of Michigan 1/s 1/s ➞ 1 1/s 1/s 1/s ➞ 1 1/s2 ➞ 1/s 1➞ s 6 6 Techniques for Reducing Power 7  Near threshold operation—Vdd near Vth  3D die stacking  Replacing DRAM with Flash memory ACAL – University of Michigan 7 7 Today: Super-Vth, High Performance, Power Constrained 8 Energy / Operation Super-Vth 3+ GHz 40 mW/MHz Normalized Power, Energy, & Performance Log (Delay) Energy per operation is the key metric for efficiency. Goal: same performance, low energy per operation 0 Vth Vnom Supply Voltage ACAL – University of Michigan Core i7 8 Sub-Vth 9 Super-Vth ~16X Log (Delay) Energy / Operation Subthreshold Design 500 – 1000X 0 Operating in the sub-threshold gives us huge power gains at the expense of performance  OK for sensors! Vth Vnom Supply Voltage ACAL – University of Michigan 9 Energy / Operation Near-Threshold Computing (NTC) Sub-Vth NTC Super-Vth ~6-8X ~2-3X Near-Threshold Computing (NTC): • 60-80X power reduction • 6-8X energy reduction Log (Delay) • Invest portion of extra transistors from scaling to overcome barriers ~50-100X ~10X 0 Vth Vnom Supply Voltage ACAL – University of Michigan 10 10 Restoring performance 12  Delay increases by 10x  Computation requires N operations  Break into N/10 parallel subtasks—execution time restored  Total energy is still 8X less—operation count unchanged  Power 80X less  Predicated on being able to parallelize workloads  Suitable for a subset of applications—as noted earlier   Streams of independent tasks—a server  Data parallel—signal/image processing Important to have a solution for code that is difficult to parallelism—single thread performance ACAL – University of Michigan 12 Interesting consequences: SRAM 10M Logic SRAM 1M 1 10 10k 1k 100 10 Dynamic 0 10 1 100m 0.0 Leakage -1 0.2 0.4 0.6 0.8 1.0 1.2 10 0.2 0.4 VDD (V)    Logic SRAM Total Energy (norm) Delay (norm) 100k 13 0.6 0.8 1.0 VDD (V) SRAM has a lower activity rate than logic VDD for minimum energy operation (VMIN) is higher Logic naturally operates at a lower VMIN than SRAM—and slower ACAL – University of Michigan 13 1.2 NTC—Opportunities and Challenges  Opportunities:  New architectures  Optimize processes to gain back some of the 10X delay  3D Integration—fewer thermal restrictions  Challenges:  Low Voltage Memory  New SRAM designs  Robustness analysis at near-threshold  Variation  Razor and other in-situ delay monitoring techniques  Adaptive body biasing  Performance Loss  Many-core designs to improve parallelism  Core boosting to improve single thread performance ACAL – University of Michigan 14 14 Proposed Parallel Architecture 15 2nd level memory 2nd level memory cluster1 cache/SRAM (f0,Vdd0,Vth0) cache/SRAM (f0,Vdd0,Vth0) cluster … Core (f0,Vdd0,Vth0) Core (f0,Vdd0,Vth0) … clustern cache/SRAM (k*fcore,Vddmem,Vthmem) level converter core1 (fcore,Vddcore,Vthcore) …… corek (fcore,Vddcore,Vthcore) 1. R. Dreslinski, B. Zhai, T. Mudge, D. Blaauw, and D. Sylvester. An Energy Efficient Parallel Architecture Using Near Threshold Operation. 16th Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), Romania, Sep. 2007, pp. 175-188. 2. B. Zhai, R. Dreslinski, D. Blaauw, T. Mudge, and D. Sylvester. Energy Efficient Near-threshold Chip Multi-processing. Int. Symp. on Low Power Electronics and Design - 2007 (ISLPED), Aug. 2007, pp. 32-37. ACAL – University of Michigan 15 Cluster Results 16 230MHz Equivalent Performance 1.2  Single CPU @ Cholesky Benchmark 233MHz L2 1 L1 Normalized Power  Baseline Core 0.8  0.6 NTC 4-Core  One core per L1  53% Avg. savings over Baseline 48% 0.4  0.2 0 Single CPU NTC 4-Core Clustered NTC Clustered NTC  Multiple cores per L1  3 cores/cluster  2 clusters  74% Avg. savings over Baseline ACAL – University of Michigan 16 New NTC Architectures 17 Next Level Memory Next Level Memory BUS / Switched Network BUS / Switched Network L1 L1 L1 L1 L1 Core Core Core Core Core Cluster Cluster Cluster Cluster L1   Recall, SRAM is run at a higher VDD than cores with little energy penalty  Caches operate faster than core Can introduce clustered architectures  Multiple Cores share L1  L1 operated fast enough to satisfy all core requests in 1-cycle  Cores see view of private single cycle L1 ACAL – University of Michigan Core L1 L1L1 Core Core L1 Core  Advantages (leading to lower power):  Clustered sharing  Less coherence/snoop traffic  Drawbacks (increased power):  Core conflicts evicting L1 data (more misses)  Additional bus/Interconnect from cores to L1 (not as tightly coupled) 17 Digression—Chip Makers Response perf 18 perf  Exchanged cores for frequency     multi / many-cores freq # cores Risky behavior  “if we build it, they will come”  predicated on the solution to a tough problem—parallelizing software Multi-cores have only been successes in  throughput environments—servers  heterogeneous environments—SoCs  data parallel applications Parallel processing is application specific  that’s OK  treat parallel machines a attached processors  true in SoCs for some time—control plane / data plane separation ACAL – University of Michigan 18 18 Measured thread level parallelism—TLP 19 Caveat: Desktop Applications Evolution of Thread-Level Parallelism in Desktop Applications G. Blake, R. Dreslinski, T. Mudge, University of Michigan, K. Flautner ARM, ISCA 2010, to appear. ACAL – University of Michigan 19 19 Single thread performance: Boosting 20 Baseline Cluster  Cache runs 4x core frequency L1  Pipelined cache Better Single Thread Performance 4 Cores @15MHz (650mV)  Boosting 4x Cache @ 60MHz (700mV) Turn some cores off, speed up the rest   Cache frequency remains the same Cluster  Cache un-pipelined L1 8x  Faster response time  Same throughput  Core sees larger cache, hiding longer 1 Core @60MHz (850mV) DRAM latency Cache @ 60MHz (1 V)  Increase core voltage and frequency further Cluster  Overclock L1  Cache frequency must be increased  Even faster response time  Increased throughput 1 Core @120MHz (1.3V) Core Core Core Core Core Core Core Core Core Core Core Core Cache @ 120MHz (1.5V) ACAL – University of Michigan 20 Single Thread Performance 21  Look at turning off cores and speeding the remaining cores to gain faster  response time. Graph of cluster performance (not measured – intuition) R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near Threshold Computing: Overcoming Performance Degradation from Aggressive Voltage Scaling. Workshop on Energy-Efficient Design (WEED 2009), held at 36th Int. Symp. on Computer Architecture, Austin, TX, June, 2009. ACAL – University of Michigan 21 Boosting Clusters—scaled to 22nm 22 Cluster Baseline  Cache runs 4x core frequency  Pipelined cache Better Single Thread Performance  Turn some cores off, speed up the rest  Cache frequency remains the same  Cache un-pipelined  Faster response time  Same throughput  Core sees larger cache, hiding longer DRAM latency  Boost core voltage and frequency further  Cache frequency must be increased  Even faster response time  Increased throughput L1 Core Core Core Core 4 Cores @140MHz 4x Cache @ 60MHz Cluster L1 Core Core 8x Core Core 1 Core @600MHz Cache @ 600MHz Cluster L1 Core Core Core 1 Core @1.2GHz Cache @ 1.2GHz ACAL – University of Michigan 22 Core Technologies for Reducing Power 24  Near threshold operation  3D die stacking  Replacing DRAM with Flash memory ACAL – University of Michigan 24 24 A Closer Look at Wafer-Level Stacking 25 Preparing a TSV—through silicon via Oxide Silicon Dielectric(SiO2/SiN) “Super-Contact” Gate Poly STI (Shallow Trench Isolation) W (Tungsten contact & via) Al (M1 – M5) Cu (M6, Top Metal) Bob Patti, CTO Tezzaron Semiconductor ACAL – University of Michigan 25 Next, stack second wafer & thin 26 FF: face-to-face Bob Patti, CTO Tezzaron Semiconductor ACAL – University of Michigan 26 Then stack a third wafer 27 3rd wafer FB: face-to-back 2nd wafer 1st wafer: controller Bob Patti, CTO Tezzaron Semiconductor ACAL – University of Michigan 27 Finally, flip, thin and add pads 28 1st wafer: controller 2nd wafer This is the completed stack 3rd wafer Bob Patti, CTO Tezzaron Semiconductor ACAL – University of Michigan 28 Characteristics     29 Very high bandwidth low-latency low-power buses possible   10,000 vias / sq mm Electrical characteristics: ∼ 1fF and < 1Ω No I/O pads for inter-stack connections—low-power  Consider a memory stack:  DDR3 ~40mW per pin  1024 Data pins →40W  4096 Data pins →160W  die on wafer ~24uW per pin Pros / cons    3D interconnect failure < 0.1ppm Heat—1 W/sq mm KGD may be a problem—foundry Different processes can be combined   DRAM / logic / analog / non-volatile memories e.g. DRAM—split sense amps and drivers from memory cells ACAL – University of Michigan 29 29 Centip3De—3D NTC Project 30 Logic - A Logic - B F2F Bond Logic - B Logic - A DRAM Sense/Logic – Bond Routing DRAM F2F Bond DRAM Centip3De Design •130nm, 7-Layer 3D-Stacked Chip •128 - ARM M3 Cores •1.92 GOPS @130mW •tapedout: Q1 2010 ACAL – University of Michigan 30 Stacking the Die 31 Cluster Configuration •4 Arm M3 Cores @ 15MHz (650mV) •1 kB Instruction Cache @ 60MHz (700mV) •8 kB Data Cache @60 MHz (700mV) •Cores connect via 3D to caches on other layer System Configuration •2-Wafer = 16 Clusters (64 Cores) •4-Wafer = 32 Clusters (128 Cores) •DDR3 Controller Estimated Performance (Raytrace) •1.6 GOPS (0.8 GOPS on 2-Wafer) •110mW (65mW on 2-Wafer) •14. 8 GOPS/W Fair Metric •Centip3De achieves 24 GOPS/W without DRAM ACAL – University of Michigan 31 Design Scaling and Power Breakdowns NTC Centip3De System  ~600 GOPS (~1k GOPS in Boost)  1.9 GOPS (3.8 GOPS in Boost)    Max 1 IPC per core 128 Cores 15 MHz  130nm To  130 mW  14.8 GOPS/W (5.5 in Boost) Max 1 IPC per core  4,608 Cores  140 MHz  ~3W 22nm  ~200 GOPS/W Boosted Mode Power (mW) NTC Mode Power (mW) 45 39 42 Cores I-Caches D-Caches DRAM 67 28 336 7.0 2.9 ACAL – University of Michigan 32 Raytracing Benchmark 32 Technologies for Reducing Power 33  Near threshold operation  3D die stacking  Replacing DRAM with Flash memory ACAL – University of Michigan 33 33 34 FIN—Thanks ACAL – University of Michigan 34 Background – NAND Flash overview 35  Dual mode SLC/MLC Flash bank organization  Single Level Cell (SLC)  1 bit/cell  105 erases per block  25 μs read, 200 μs write  Multi Level Cell (MLC)  2 bits/cell  104 erases per block  50 μs read, 680 μs write 2112 / 4224 bytes SLC page MLC pages 2048 2048 64 64 2048 64 1 block = 64 SLC/128 MLC pages Dual Mode NAND Flash memory  Addressable read/write unit is the page  Pages consist of 2048 bytes + 64 ‘spare’ bytes  Erases 64 SLC or 128 MLC pages at a time (a block)  Technology – less than 60nm ACAL – University of Michigan 35 Reducing Memory Power 36 Area/bit (μm2) $/Gb Active Power Idle Power DRAM 0.015 3 495mW 15mW 55ns 55ns N/A NAND 0.005 0.25 50mW 6μW 25μs 200μs 1.5ms PCM 0.068 ? 6μW 55ns 150ns N/A Read Write Erase latency latency latency NAND Flash cost assumes 2-bit-per-cell MLC. DRAM is a 2 Gbit DDR3-1333 x8 chip. Flash power numbers are a 2 Gbit SLC x8 chip. Area from ITRS Roadmap 2009.  Flash is denser than DRAM  Flash is cheaper than DRAM  Flash good for idle power optimization  1000× less power than DRAM  DRAM still required for acceptable access latencies  Flash not so good for low access latency usage model  Flash “wears out” – 10,000/100,000 write/erase cycles ACAL – University of Michigan 36 36 A Case for Flash as Secondary Disk Cache  37 Many server workloads use a large working-set (100’s of MBs ~ 10’s of GB and even more)  Large working-set is cached to main memory to maintain high throughput Large portion of DRAM to disk cache     Many server applications are more read intensive than write intensive Flash memory consumes orders of magnitude less idle power than DRAM Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content  Client requests are spatially and temporally a zipf like distribution  e.g. 90% of client requests are to 20% of files ACAL – University of Michigan 37 37 A Case for Flash as Secondary Disk Cache 38 Specweb99 MP4 MP8 MP12 Network bandwidth - Mbps (Throughput) 1,200 1,000 800 600 An access latency of 100’s of microseconds can be tolerated. 400 200 0 12us 25us 50us 100us 400us 1600us disk cache access latency to 80% of files T. Kgil, and T, Mudge. FlashCache: A NAND Flash memory file cache for low power web servers. Proc. Conf. Compiler and Architecture Support for Embedded Systems (CASES'06), Seoul, S. Korea, Oct. 2006, pp. 103-112. ACAL – University of Michigan 38 38 Overall Architecture 39 Processors Tables used to manage Flash memory 128MB DRAM FCHT FBST 1GB DRAM FPST FGST 128MB DRAM 1GB Flash Flash ctrl. 1GB Flash DMA Generic main memory + Primary disk cache Secondary disk cache Main memory HDD ctrl. Hard Disk Drive Baseline without FlashCache FlashCache Architecture ACAL – University of Michigan 39 39 Overall Network Performance - Mbps MP4 MP8 40 MP12 Network Bandwidth - Mbps 1,200 1,000 128MB DRAM + 1GB NAND Flash performs 800 as 600well as 1GB DRAM while requiring only about 400 die area (SLC Flash assumed) 1/3 200 0 DRAM 32MB + DRAM 64MB + DRAM 128MB DRAM 256MB DRAM 512MB FLASH 1GB FLASH 1GB +FLASH 1GB +FLASH 1GB +FLASH 1GB DRAM 1GB Specweb99 ACAL – University of Michigan 40 40 Overall Main Memory Power read power write power 41 idle power Overall Power - W 3 2.5 2.5W 2 1.6W Flash Memory consumes much less idle power 1.5 than DRAM 1 0.6W 0.5 0 DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash 1GB SpecWeb99 ACAL – University of Michigan 41 41 Concluding Remarks on Flash-for-DRAM 42  DRAM clockless refresh reduces idle power  Flash density continues to grow  Intel-Micron JV announced 25nm flash   8GB die 167sq mm 2 bits per cell  3 bits/cell is coming soon PCRAM appears to be an interesting future alternative  I predict single level storage using some form of NV memory with disks replacing tape for archival storage ACAL – University of Michigan 42 42 Cluster Size and Boosting 0.9 16.0 0.8 GOPS 0.7 GOPS/W •2-Die stack 12.0 10.0 0.5 8.0 0.4 6.0 0.3 •Fixed die size •Fixed amount of cache per core •Raytrace Algorithm GOPS/W 4.0 0.2 2.0 0.1 1.8 0.0 0.0 1 2 3 4 Cores per Cluster 4-Core clusters are 27% more energy efficient than 1-Core clusters Boosting 4-Core Version Achieves 81% more GOPS than 1-Core (Larger Cache) 1.6 5 1.4 Boosted GOPS GOPS Analysis: 14.0 0.6 43 1.2 1 0.8 0.6 0.4 0.2 0 1 ACAL – University of Michigan 2 3 4 Cores per Cluster 43 5 System Architecture B-B Interface 4-Core Cluster 4-Core Cluster Layer Hub 44 B-B Interface Sys Ctrl Clock 4-Core Cluster 4-Core Cluster 4-Core Cluster 4-Core Cluster 4-Core Cluster 4-Core Cluster 4-Core Cluster 4-Core Cluster Layer Hub Clock System Comm Fwd DRAM DRAM JTAG 4-Core Cluster 4-Core Cluster Layer B Layer A 4-Core Cluster 4-Core Cluster Layer Hub Layer Hub 4-Core Cluster 4-Core Cluster JTAG DRAM DRAM DRAM DRAM ACAL – University of Michigan Mem Fwd Mem Fwd DRAM DRAM DRAM DRAM 44 Cluster Architecture JTAG In M3 65 Mhz Cache Clock 60 Mhz System Clock Cluster Clock Gen ACAL – University of Michigan Cluster I-Cache 1024b 4-Way 60 Mhz Cache Clock AMBA-like Busses To DRAM (128-bit) Cluster D-Cache 8192b 4-Way M3 JTAG Out Layer 1 AMBA-like Buses (32-bit) 3D integration M3 M3 15 Mhz Core Clocks with 0,90,180,270 degree phase offsets 45 Cluster MMIO, Reset Ctrl, etc System Communication Layer 2 45 46  With With power and cooling becoming an increasingly costly part of the  operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die stacking; and 3) replacing DRAM with Flash memory.power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die ; and 3) replacing DRAM with Flash memory. ACAL – University of Michigan 46 46 47 ACAL – University of Michigan 47 47 Solutions? 48  Reduce Vdd    “Near Threshold Computing”  Drawbacks  slower less reliable operation  But  parallelism  suits some computations (more as time goes by)  robustness techniques  e.g. Razor—in situ monitoring Cool chips (again)  interesting developments in microfluidics Devices that operate at lower Vdd without performance loss ACAL – University of Michigan 48 48 Low-Voltage Robustness 6T with VTH Selection 6T w/0 VTH Selection 8T with VTH Selection 8T w/0 VTH Selection 2 Logic Rule Bitcell Area (m ) 2.0 49 1.8 1.6 1.4 1.2 1.0 0.8 300 400 500 600 700 800 900 1000 VDD (mV)     VDD scaling reduces SRAM robustness Maintain robustness through device sizing and VTH selection Robustness measured using importance-sampling In NTC range 6T is smaller ACAL – University of Michigan 49 SRAM Designs 50  HD-SRAM 40 WLRW 30 25 20 BLRW R 15 10 5 0 2 4 6 8 10 12 14  Skew VDD in column  Skew GND in row VDD  Target failing cells WL  No bitcell changes BL  Skew hurts some cells 1 VDD2 WL BLBAR GND1 GND2 1.0 1.1V 0.7V 0.8 0.6 0.4 0.2 0.0 0 20 40 60 80 VDD and GND Skew (mV) ACAL – University of Michigan 16 Read + Write Margin (VTH) Crosshairs Row & Column Error Rate (norm)  Number of Chips  Differential write  Single-ended read  Asymmetric sizing BL  HD µ/σ = 12.1 / 1.16  6T µ/σ = 11.0 / 1.35 Half-differential Differential 35 WLR 50 100 Evolution of Subthreshold Designs 51 Subliminal 1 Design (2006) -0.13 µm CMOS processor 244m -Used to investigate existence of Vmin 122m memory -2.60 µW/MHz 305m 181m Proc A Proc B Proc C 253 µm Subliminal 2 Design (2007) -0.13 µm CMOS IMEM IMEM CORE CORE CORE DMEM DMEM DMEM IMEM 253 µm 715 µm Phoneix 1 Design (2008) 98 µm - 0.18 µm CMOS -Used to investigate process variation -Used to investigate sleep current -3.5 µW/MHz -2.8 µW/MHz Phoenix 2 Design (2010) - 0.18 µm CMOS -Commercial ARM M3 Core -Used to investigate: •Energy harvesting •Power management -37.4 µW/MHz Unpublished Results – ISSCC 2010 – Do not disclose ACAL – University of Michigan 51

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ARM General Purpose Processor