Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo What’s the challenge? Our Primary Goal: Performance How ? increase the number and/or operating frequency of functional units AND supply functional units with sufficient data (bandwidth) Problems: Memory Wall system performance is limited by poor memory performance Power Wall power consumption is approaching cooling limitation France-Japan PAAP Workshop 2007. 11. 2. 2 Memory Wall Problem Performance improvement CPU Memory 1000000 100000 10000 1000 100 10 10 20 08 20 06 20 04 20 02 20 00 20 98 19 96 19 94 19 92 19 90 19 88 19 86 19 84 19 19 82 1 80 CPU: 55% / year DRAM: 7% / year 19 Relative Performance Year France-Japan PAAP Workshop 2007. 11. 2. 3 Example of Memory Wall: Performance of 2GHz Pentium4 for a[i]=b[i]+c[i] Performance [MFLOPS] L1 hit L2 hit 500 450 400 350 300 250 200 150 100 50 0 non-blocking cache & out-of-order issue 1/6 cache miss 10 100 1000 Vector Length 10000 100000 1000000 lack of effective memory throughput France-Japan PAAP Workshop 2007. 11. 2. 4 Recap: Memory Wall Problem growing gap between processor and memory speed Itanium2/Montecito : Huge L3 cache (12MB x 2) performance is limited by memory ability in High Performance Computing (HPC) long access latency of main memory lack of throughput of main memory making full use of local memory (on-chip memory) of wide bandwidth is indispensable on-chip memory space is valuable resource not enough for HPC should exploit data locality France-Japan PAAP Workshop 2007. 11. 2. 5 Does cache work well in HPC? works well in many cases, but not the best for HPC data location and replacement by hardware × unfortunate line conflicts occur although most of data accesses are regular ex. data used only once flush out other useful data transfer size of cache off-chip is fixed for consecutive data: larger transfer size is preferable for non-consecutive data: large line transfer incurs unnecessary data transfer waste of bandwidth Most of HPC applications exhibit regularity in data access, which is sometimes not well enjoyed. France-Japan PAAP Workshop 2007. 11. 2. 6 SCIMA (Software Controlled Integrated Memory Architecture) [kondo-ICCD2000] (joint work with Prof. Boku @ Univ. of Tsukuba and others) addressable SCM in addition to ordinary cache ALU FPU register reconfigurable a part of logical address space no inclusive relations with Cache SCM and cache are reconfigurable at the granularity of way (SCM: Software Controllable Memory) SCM Cache SCM ・・・ Memory (DRAM) overview of SCIMA Cache NIA Network address space France-Japan PAAP Workshop 2007. 11. 2. 7 Data Transfer Instruction load/store Register page-load/page-store load/store register SCM/Cache SCM Off-Chip Memory large granularity transfer SCM Cache line transfer wider effective bandwidth by reducing latency stall block stride transfer Off-Chip Memory New avoid unnecessary data transfer more effective utilization of On-Chip Memory page-load/page-store France-Japan PAAP Workshop 2007. 11. 2. 8 Strategy of Software Control SCM must be controlled by software arrays are classified into 6 groups Consecutiveness consecutive stride irregular first, apply (1) (2) (1) use SCM as a stream buffer reserve SCM (4) for reused data (2) use SCM as a (5) reserve SCM (3) not use (6) reserve SCM for reused data stream buffer SCM not-reusable for reused data allocate small stream buffer in SCM second, apply (4) (5) and (6) allocate rest area of SCM for reused data reusable Reusability ・prototype of semi-automatic compiler : users specify hints on reusability of data arrays France-Japan PAAP Workshop 2007. 11. 2. 9 Results of Memory Traffic unnecessary memory traffic is suppressed cache miss page-load/store benchmark programs CG, FT, QCD assumption 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 cache model: cache size = 64KB(4way) SCM size = 0KB SCIMA mode: Cache SCIMA CG Cache SCIMA FT 32 12 8 32 12 8 32 12 8 32 12 8 cache size = 16KB (1way) SCM size = 48KB 32 12 8 line size 32 12 8 Normalized Memory Traffic Cache SCIMA total # of way: 4 line size: 32B, 128B QCD 1% - 61% of memory traffic decreases in SCIMA due to fully exploitation of data reusability France-Japan PAAP Workshop 2007. 11. 2. 10 Results of Performance CPU busy time latency stall throughput stall 1 0.8 0.6 0.4 0.2 CPU busy time latency stall : elapsed time due to memory latency throughput stall : elapsed time due to lack of throughput Cache SCIMA CG Cache SCIMA 32 12 8 32 12 8 32 12 8 32 12 8 assumption 32 12 8 0 line size 1.2 32 12 8 Normalized Execution Time normalized execution time Cache FT SCIMA QCD load/store latency: 2cycle bus throughput: 4B/cycle memory latency: 40cycle 1.3-2.5 times faster than cache latency stall reduction by large granularity of data transfer throughput stall reduction by suppressing unnecessary data transfer France-Japan PAAP Workshop 2007. 11. 2. 11 Power Wall Next Focus: Power Consumption of Processors Is there any room for power reduction ? If yes, then how to reduce ? Trends of Heat Density ♦ Itanium (130W) France-Japan PAAP Workshop 2007. 11. 2. 12 Observation(1) Moore’s Law Num. of transistors : doubles every 18 months France-Japan PAAP Workshop 2007. 11. 2. 13 Observation (2) – frequency – Frequency doubles every 3 years. Number of transistors : doubles every 18 months Number of switching on a chip: 8 times every 3 years France-Japan PAAP Workshop 2007. 11. 2. 14 Observation (3) – performance – # of switching on a chip: 8 times every 3 years effective performance: 4 times every 3 years “microprocessor performance improved 55% per year” from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann unnecessary switching = chance of power reduction: doubles every 3 years France-Japan PAAP Workshop 2007. 11. 2. 15 An Evidence of the Observation access energy per instruction (nJ) - unnecessary switching = x2 / 3 years [Zyuban00] @ ISLPED’00 rename map table bypass mechanism load/store window issue window register file functional units flushed instruction committed instruction Issue Width 4 6 8 10 12 energy/instr. increases to exploit ILP for higher performance at functional units : no increase at issue window, register file : increase flushed instruction by incorrect prediction: increase France-Japan PAAP Workshop waste of power 2007. 11. 2. 16 Registers Register consumes a lot of power Open Question roughly speaking, power ∝(num. of registers) X (num. of ports) high performance wide issue superscalar processors more registers, more read/write ports in HPC, what is the best way to use many function units (or accelerators) from the perspective of register file design scalar registers with SIMD operations vector registers with vector operations ……… Personal Impression vector registers are accessed in well-organized fashion, it is easy to reduce “num. of ports” by sub-banking technique can vector operations make good use of local on-chip memory? (at least, traditional vector processors can never!) France-Japan PAAP Workshop 2007. 11. 2. 17 Dual Core helps … Rule of thumb Voltage Frequency 1% 1% Power Performance 3% 0.66% In the same process technology… Cache Core Voltage = 1 Freq =1 Area =1 Power = 1 Perf =1 Cache Core Core Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 France-Japan PAAP Workshop 2007. 11. 2. 18 Multi-Core helps more … Power Cache 4 Power = 1/4 Performance Performance = 1/2 3 Large Core 2 2 1 1 Small Core 1 1 no need for wider instruction issue C1 C2 Cache C3 C4 4 4 3 3 2 2 1 1 Multi-Core: Power efficient Better power and thermal management France-Japan PAAP Workshop 2007. 11. 2. 19 Leakage problem 2 Power (W), Power Density (W/cm) IEEE Computer Magazine How to attack leakage problem? 1400 1200 1000 [Borkar-MICRO05] [Borkar-MICRO05] SiO2 VDD 10 mm Die leakage current Lkg SD Lkg Active ON 800 Input 0600 OFF 400 200 0 90nm 65nm 45nm 32nm 22nm 16nm France-Japan PAAP Workshop 2007. 11. 2. 20 Introduction of our research Innovative Power Control for Ultra Low-Power and High-Performance System LSIs 5 years project started October, 2006 supported by JST (Japan Science and Technology Agency) as a CREST (Core Research for Evolutional Science and Technology) program Objective: drastic power reduction of high-performance system LSIs by innovative power control through tight cooperation of various design levels including circuit, architecture, and system software. Members: Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader] Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS Prof. H. Amano (Keio Univ): architecture & F/E design Prof. K. Usami (Shibaura I.T.): circuit & B/E design France-Japan PAAP Workshop 2007. 11. 2. 21 How to reduce leakage: Power Gating Focusing on Power Gating for reducing leakage Inserting a Power Switch (PS) between VDD and GND Turning off PS when sleep VDD VDD logic gates GND Sleep logic gates Power Switch France-Japan PAAP Workshop Virtual GND 2007. 11. 2. 22 Run-time Power Gating (RTPG) Circuit B Circuit A Power Switch Circuit C Sleep Control ckt control power switch at run time Coarse grain: Mobile processor by Renesas (independent power domains for BB module, MPEG module, ..) Fine grain (our target): power gating within a module France-Japan PAAP Workshop 2007. 11. 2. 23 Fine-grain Run-time Power Gating Longer sleep time is preferable Leakage savings Overheads: power penalties for wakeup Evaluation through a real chip not reported Test vehicle: 32b x 32b Multiplier Either or both operands (input data) are likely less than 16-bit Circuit portions to compute upper bits of product need not to operate waste leakage power By detecting 0s at upper 16-bits of operands, power gate internal Multiplier array France-Japan PAAP Workshop 2007. 11. 2. 24 Power dissipation(mW) Test chip "Pinnacle" real measurement 4.0 3.5 125C 3.0 85C 2.5 25C 2.0 Sequence 1 (No sleep) Not applied Sequence 2 (Domain H sleeps) Sequence 3 (Domain H and M sleep) FG-RTPG applied Technology STARC 90nm CMOS Multiplier core Area # cells 0.544 × 0.378 mm2 15,000 Design time 4.5 months Design members 3 Master students, 1 Bachelor student, 1 Faculty - Exhibits good power reduction - Current Status Designing a pipelined microprocessor with FG-RTPG Compiler (instruction scheduler) to increase sleep time France-Japan PAAP Workshop 2007. 11. 2. 25 Low Power Linux Scheduler based on statistical modeling Co-optimization of System Software and Architecture Objective: process scheduler which reduce power consumption by DVFS (dynamic voltage and frequency scaling) of each process with satisfying its performance constraint How to find the lowest frequency with satisfying performance constraints ? it depends on hardware and program characteristics performance ratio is different from frequency ratio hard to find the answer straightforward modeling by statistical analysis of hardware events France-Japan PAAP Workshop 2007. 11. 2. 26 Evaluation result Relative Performance 1.0 Pentium M 760 (Max 2.00 GHz, FSB 533 MHz) 0.9 0.8 0.7 mcf bzip2 swim mgrid matrix (50) matrix (600) matrix (1000) Threshold 0.6 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Performance Threshold May 8, 2007 Specified threshold Perf. is within the threshold in all the cases except for mgrid Black dotted line 3-7% below the threshold Accurate model is obtained Linux scheduler using this model is developed France-Japan PAAP Workshop 2007. 11. 2. 27 27 Summary Challenge for high performance processors: One solution to memory wall Memory Wall and Power Wall make good use of on-chip memory with software controllability Solutions to power wall many cores will relax the problem, but leakage current is getting a big problem new research/approach is required our project “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs” is introduced France-Japan PAAP Workshop 2007. 11. 2. 28 France-Japan PAAP Workshop 2007. 11. 2. 29