Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multi-Core Parallelism for LowPower Design Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University http://www.eng.auburn.edu/~vagrawal [email protected] 2/8/06 D&T Seminar 1 Power Consumption of VLSI Chips Why is it a concern? 2/8/06 D&T Seminar 2 SIA Roadmap for Processors (1999) Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm2 6.2M 18M 39M 84M 180M 390M Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9 Chip size (mm2) 340 430 520 620 750 900 Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5 High-perf. Power (W) 90 130 160 170 175 183 Source: http://www.semichips.org 2/8/06 D&T Seminar 3 ISSCC, Feb. 2001, Keynote Patrick P. Gelsinger Senior Vice President General Manager Digital Enterprise Group INTEL CORP. 2/8/06 “Ten years from now, microprocessors will run at 10GHz to 30GHz and be capable of processing 1 trillion operations per second -- about the same number of calculations that the world's fastest supercomputer can perform now. “Unfortunately, if nothing changes these chips will produce as much heat, for their proportional size, as a nuclear reactor. . . .” D&T Seminar 4 VLSI Chip Power Density Source: Intel Sun’s Surface Power Density (W/cm2) 10000 1000 Nuclear Reactor 100 8086 Hot Plate 10 4004 8008 8085 386 286 8080 1 1970 2/8/06 Rocket Nozzle 1980 P6 Pentium® 486 1990 Year D&T Seminar 2000 2010 5 Power Dissipation in CMOS Logic (0.25µ) Ptotal (0→1) = CL VDD2 + tscVDD Ipeak + VDDIleakage VDD VDD CL %75 2/8/06 %20 D&T Seminar %5 6 Low-Power Datapath Architecture • Lower supply voltage – This slows down circuit speed – Use parallel computing to gain the speed back • Works well when threshold voltage is also lowered. • About 60% reduction in power obtainable. • Reference: A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Boston: Kluwer Academic Publishers (Now Springer), 1995. 2/8/06 D&T Seminar 7 Combinational logic Register Input Register A Reference Datapath Output Cref CK Supply voltage Total capacitance switched per cycle Clock frequency Power consumption: Pref 2/8/06 D&T Seminar = Vref = Cref =f = CrefVref2f 8 Comb. Logic Copy 2 Multiphase Clock gen. and mux control f/N Register f/N N = Deg. of parallelism Register Input Comb. Logic Copy 1 Supply voltage: VN ≤ V1 = Vref N to 1 multiplexer f/N Register A copy processes every Nth input, operates at reduced voltage Register A Parallel Architecture Output f Comb. Logic Copy N CK 2/8/06 D&T Seminar 9 Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4 2/8/06 D&T Seminar 10 Power PN = Pproc + Poverhead Pproc = N(Cinreg+ Ccomb)VN2f/N + CoutregVN2f = (Cinreg+ Ccomb+Coutreg)VN2f = CrefVN2f CoverheadVN2f PN [1 + δ(N – 1)]CrefVN2f = PN ── P1 2/8/06 ≈ δCref(N – 1)VN2f Poverhead = = VN2 [1 + δ(N – 1)] ─── Vref2 D&T Seminar 11 Voltage vs. Speed Delay of a gate, T ≈ CLVref ──── I = CLVref ────────── k(W/L)(Vref – Vt)2 Normalized gate delay, T where I is saturation current k is a technology parameter W/L is width to length ratio of transistor Vt is threshold voltage 4.0 1.2μ CMOS Voltage reduction slows down as we N=3 3.0 get closer to Vt N=2 2.0 N=1 1.0 0.0 2/8/06 Vt V V2=2.9V Vref =5V 3 D&T Seminar Supply voltage 12 Increasing Multiprocessing 1.0 1.2μ CMOS, Vref = 5V 0.8 Vt=0.8V 0.6 PN/P1 Vt=0.4V 0.4 0.2 Vt=0V (extreme case) 0.0 1 2 3 4 5 6 7 8 9 10 11 12 N 2/8/06 D&T Seminar 13 Extreme Cases: Vt = 0 Delay, T α 1/ Vref For N processing elements, delay = NT → VN = Vref/N PN ── P1 = [1+ δ (N – 1)] 1 ── N2 → 1/N For negligible overhead, δ→0 PN ── P1 ≈ 1 ── N2 For Vt > 0, power reduction is less and there will be an optimum value of N. 2/8/06 D&T Seminar 14 Example: Multiplier Core • Specification: • 200MHz Clock • 15W dissipation @ 5V • Low voltage operation, VDD ≥ 1.5 volts Relative clock rate = (VDD – 0.5)2 ─────── 20.25 • Problem: • Integrate multiplier core on a SOC • Power budget for multiplier ~ 5W 2/8/06 D&T Seminar 15 Multiphase Clock gen. and mux control 40MHz Reg 40MHz Output Reg Multiplier Core 2 5 to 1 mux Input Reg 40MHz Multiplier Core 1 Reg A Multicore Design 200MHz Multiplier Core 5 200MHz CK Core clock frequency = 200/N, N should divide 200. 2/8/06 D&T Seminar 16 How Many Cores? • For N cores: • clock frequency = 200/N MHz • Supply voltage, VDDN= 0.5 + (20.25/N)1/2 Volts • Assuming 10% overhead per core, VDDN 2 Power dissipation =15 [1 + 0.1(N – 1)] (───) watts 5 2/8/06 D&T Seminar 17 Design Tradeoffs Number of cores N Clock (MHz) Core supply VDDN (Volts) Total Power (Watts) 1 200 5.00 15.0 2 100 3.68 8.94 4 50 2.75 5.90 5 40 2.51 5.29 8 25 2.10 4.50 2/8/06 D&T Seminar 18 Power Reduction in Processors • Just about everything is used. • Hardware methods: • • • • Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Clock gating, frequency reduction Sleep mode • Architecture: • Instruction set • hardware organization • Software methods 2/8/06 D&T Seminar 19 Parallel Architecture Processor Input Output Output Processor Input 2/8/06 f/2 f Processor Capacitance = C Voltage = V Frequency = f Power = CV2f f/2 D&T Seminar f Capacitance = 2.2C Voltage = 0.6V Frequency = 0.5f Power = 0.396CV2f 20 Output Input ½ Proc. Output f f Capacitance = 1.2C Voltage = 0.6V Frequency = f Power = 0.432CV2f Capacitance = C Voltage = V Frequency = f Power = CV2f 2/8/06 ½ Proc. Register Processor Register Input Register Pipeline Architecture D&T Seminar 21 Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance nC C Voltage V/n V/n Frequency f/n f Power CV2f/n2 CV2f/n2 Chip area n times 10-20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998. 2/8/06 D&T Seminar 22 Performance based on SPECint2000 and SPECfp2000 benchmarks Multicore Processors 2/8/06 Computer, May 2005, p. 12 Multicore Single core 2000 2004 D&T Seminar 2008 23 Multicore Processors • D. Geer, “Chip Makers Turn to Multicore Processors,” Computer, vol. 38, no. 5, pp. 11-13, May 2005. • A. Jerraya, H. Tenhunen and W. Wolf, “Multiprocessor Systems-on-Chips,” Computer, vol. 5, no. 7, pp. 36-40, July 2005; this special issue contains three more articles on multicore processors. • S. K. Moore, “Winner Multimedia Monster – Cell’s Nine Processors Make It a Supercomputer on a Chip,” IEEE Spectrum, vol. 43. no. 1, pp. 20-23, January 2006. 2/8/06 D&T Seminar 24 Cell - Cell Broadband Engine Architecture © IEEE Spectrum, January 2006 Nine-processor chip: 192 Gflops 2/8/06 L to R Atsushi Kameyama, Toshiba James Kahle, IBM Masakazu Suzoki, Sony D&T Seminar 25 Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 2/8/06 D&T Seminar Eight Identical Processors f = 5.6GHz (max) 44.8 Gflops 26 ? 2/8/06 D&T Seminar 27 Amdahl’s Law P=1–S S 0 1 Speedup = 1 ───────── S + (1 – S)/ N Where N = number of parallel processors Example: time S = 0.6, N = 10, Speedup = 1.56 S = 0.6, N = ∞, Speedup = 1.67 Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” AFIPS Conference Proceedings, (30), pp. 483-485, 1967. 2/8/06 D&T Seminar 28 Question • Can we find a multi-processing law – for power reduction, or – for performance per watt 2/8/06 D&T Seminar 29