Download Pentium® 4 - Webcourse

Low-power computer architecture Dr. Avi Mendelson 1 Disclaimer    No Intel proprietary information is disclosed. Every future estimate or projection is only a speculation Responsibility for all opinions and conclusions falls on the author only.   It does not means you cannot trust them…  © Dr. Avi Mendelson 2 Agenda  The power crisis  Power consumption  Power density and thermal limitations  General solutions and directions © Dr. Avi Mendelson 5 Moore’s law “Doubling the number of transistors on a manufactured die every year” - Gordon Moore, Intel Corporation 109 256M 64M Memory 108 16M Transistors Per Die Microprocessor 4M 107 1M 256K 106 64K 16K 105 Pentium®III Pentium® II Pentium® Pro ® Pentium i386™ 4K 80286 1K 104 103 i486™ Pentium®4 8086 Source: Intel  8080 4004 102 ’70 ’73 ’76 ’79 ’82 ’85 ’88 © Dr. Avi Mendelson ’91 ’94 '97 2000 6 In the Last 25 Years Life was Easy   (*) Doubling of transistor density every 30 months Increasing die sizes, allowed by  Increasing Wafer Size  Process technology moving from “black art” to “manufacturing science”   Doubling of transistors every 18 months Tech     Old Arch i386C i486C Pentium® Pentium® III mm (linear) 6.5 9.5 12.2 10.3 New Arch i486 Pentium® Pentium® Pro Next Gen mm (linear) 11.5 17 17.3 ? Ratio 3.1 3.2 2.1 2--3 Implications: (in the same technology) 1. New Arch ~ 2-3X die area of the last Arch 2. Provides 1.5-1.7X integer performance of the last Arch © Dr. Avi Mendelson (*) source Fred Pollack, Micro-32 7 Suddenly, the power monster appears in all different market segments © Dr. Avi Mendelson 8 Processor Power Evolution ? 100 Pentium® II Pentium® 4 Max Power (Watts) Pentium® Pro Pentium® III 10 Pentium® Pentium® w/MMX tech. i486 i386 1    6 3 2  3 Traditionally: new generation always increase power Compactions: higher performance at lower power Used to be “One size fits all”: start with high power and shrink to Mobile © Dr. Avi Mendelson 9 The power crisis – power consumption Sourse: coolchips, Micro 32 © Dr. Avi Mendelson 10 Power challenges per segment Servers Desktops Mobile Handhelds Thermal cost Delivery cost Form factor Thermal cost Delivery cost Thermal cost Delivery cost Form factor Battery size Form Factor Battery size Battery cost Price drivers Performance Performance Noise Perf/$$ Performance Noise Perf/Kg. Battery life Performance Battery life Max performance @ thermal constraint Max performance @ thermal constraint Max battery life Max battery life Max perf/power to meet application’s need Power related system cost drivers Perf/inch^3 Optimization Max performance point @ thermal constraint © Dr. Avi Mendelson 11 Power & Energy Power  Dynamic power: consumed by transistors during switching.  P = aCV2f - Work done per time unit (Watts) (a: activity, C: capacitance, V: voltage, f: frequency)  Static Power (Leakage): consumed by all “inactive transistors”, it depends on temperature and voltage. Power aware architectures -> aim to reduce peak power Energy Power consume during some period of time. Energy aware architectures -> aims to reduce average power consumption  © Dr. Avi Mendelson 12 Power Evolution (Theoretical) 250 Watts 200 Leakage Power Active Power 150 100 50 0 2  3  For a 15mm/side die (225mm2) Assume 2X frequency increase each generation Future process numbers are estimated © Dr. Avi Mendelson 13 Why high power matters Power Limitations  Higher power  higher current –  Cannot exceed platform power delivery constraints Higher power  higher temperature Cannot exceed the thermal constraints (e.g., Tj < 100oC) – Increases leakage.  The heat must be controlled in order to avoid electric migration and other “chemical” reactions of the silicon – Energy  Affects battery life.   Consumer devices – the processor may consume most of the energy Mobile computers (Laptops) - the system (display, disk, cooling, energy supplier, etc) consumes most of the energy  Affects the cost of Electricity © Dr. Avi Mendelson 14 Power Density 1000 Rocket Nozzle Nuclear Reactor Watts/cm 2 100 Pentium® 4 Pentium® III Pentium® II Hot plate 10 Pentium® Pro Pentium® i386 i486 1     3 2  3   * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. © Dr. Avi Mendelson 15 © Dr. Avi Mendelson 16 Why power and power density increase over time ? © Dr. Avi Mendelson 17 How do we keep up with the Moore’s Law?     Every 18 month in average we introduce a new process The new process shrinks the dimension of the transistors by 0.7 (ideal shrink) As a result, on the same die area, we can have more transistors, each of them running at higher frequency One may mistakenly think that this is the reason for the increase in power and power density. © Dr. Avi Mendelson 18 Scaling theory--1 of 2 Width  W  0.7, Length  L  0.7, tox  0.7  Lateral and vertical dimensions reduce by 30% 0.7  0.7 Area Cap  Ca   0.7, 0.7 Fringing Cap  Cf  0.7, Total Cap  C  0.7  Capacitance--area and fringing—reduce by 30% Die Area  X  Y  0.7  0.7  0.72  Die area reduces 50% © Dr. Avi Mendelson 19 Scaling theory--2 of 2 Cap 0.7   0.7 Transistor 1  Capacitance per transistor reduces 30% Cap 0.7 1   Area 0.7  0.7 0.7  Capacitance per unit area increases 43% Vdd  0.7, Vt  0.7, I  W 0.7  0.7 (Vdd  Vt )   0.7 tox 0.7 C  Vdd 0.7  0.7 0.7  0.72 2 T    0.7, Power  C  V  f   0.72 I 0.7 0.7  Delay reduces 30%, power reduces 50% © Dr. Avi Mendelson 20 Ideal Scenarios...  Ideal “Shrink”  Same  Ideal New arch arch  Same #Xistors  0.5X size  1.5X frequency die size  2X #Xistors  1X size  1.5X frequency  0.5X  1X  1X power  1X IPC (instr./cycle)  1.5X performance  1X power density © Dr. Avi Mendelson power  2X IPC  3X performance  1X power density 21 Process Technologies – Reality  But in reality:     So, every new process and architecture generation:     New process is not ideal anymore New designs squeeze frequency to 2X per process New designs use more transistors (2X-3X to get 1.5X-1.7X perf) Power goes up about 2X Power density goes up 30%~80% This is bad, and… Will get worse in future process generations:   Voltage (Vdd) will scale down less Leakage is going to the roof © Dr. Avi Mendelson 22 Die increases in order to maintain performance boost Silicon Process Technology 1.5µ 1.0µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ Intel386™ DX Processor Intel486™ DX Processor Pentium® Processor Pentium® Pro Processor Pentium® II Processor Pentium® III Processor Pentium® 4 Processor © Dr. Avi Mendelson 23 Put it all together: Power and Power density are real threat to the Moore’s law  Complex algorithms lead to denser power:   Timing pressure leads to faster/bigger/power-hungrier gates    Dense random logic Designers put together units that communicate with each other. It creates “regions” with high activity factors -> hot spots. Power is not distributed evenly over the chip. A failure can happen if a single point reach the max power point. Many of the modern processors are power limited © Dr. Avi Mendelson 24 Some implications    We can’t build microprocessors with ever increasing power density and die sizes The constraint is power – not manufacturability The design of any future micro-processor should take power into consideration. We need to distinguish between different aspects of power: Power delivery  Max power (TJ)  Power density - hot spots  Energy – static + dynamic Power and Energy aware design should take care of each of these aspects   One-size does not fit all anymore © Dr. Avi Mendelson 25 General solutions and directions   Assume that one size does not fit all. For different segments there may be different solutions (although many of them share the same principle of operation). © Dr. Avi Mendelson 26 Embedded systems vs. Laptops  Embedded systems  Most of the power is consumed by the CPU  Usually not thermally limited.  What we really care about is battery life and meeting the timing limitations.  In real time systems we can take advantage of known “deadlines”  Laptops (Mobile systems)  We are thermally limited.  We can not use deadlines (most of the time).  We need to optimize for max battery life and max performance in a given power envelope. © Dr. Avi Mendelson 27 How to extend Battery life: Voltage Scaling   Within a given voltage range, higher voltage allows higher freq. Used for trading power and frequency. Either    Statically, at manufacturing time Dynamically, at run time (e.g., Intel’s SpeedStep® Technology) Actual range depends on specific design and process technology Examples*:  Intel® XScale™ processors runs from 0.75V (150MHz/50mW) to 1.65V (800MHz/900mW)  Intel mobile Pentium® III processor sells from 1.1V (600MHz) to 1.7V (1GHz) 1000 XScale proc. freq & power vs voltage 900 800 Fequency(Mhz) 700 Power (mWatt) 600 500 400 300 200 100 0 * Source: Intel Corp. (http://developer.intel.com) © Dr. Avi Mendelson 0.5 0.7 0.9 1.1 1.3 1.5 1.7 28 1.9 Voltage Scaling (cont.)  Huge effect on Dynamic Power: 20% freq reduction  20% voltage reduction  35% energy reduction. (aCV2 = aC*0.82 = aC*0.64)  50% power reduction. (aCV2f = aC*0.83 = aC*0.51)  Even more impressive if we recall:   20% freq hit  only 10%-15% performance hit* Voltage scaling can be used to trade performance for power Reduce the power consumption when performance needs can be released e.g., if deadlines known and if we have enough “dead time”, we can extend the execution time on the expense of lowering the voltage.   BUT it has technology limitations * Depends mainly on core to bus frequency ratio and caches size. © Dr. Avi Mendelson 29 How to extend battery life: energy Efficiency  Energy per task Proportional Proportional to # of processed instructions per task to the average work consumed per instruction  “Energy per (retired) instruction” = b*W, where b: Ratio of Total to Retired number of processed instructions  W: Average energy spent in processing an instruction Both figures deteriorate with every new microarchitecture  Since speculation increases and complexity grows   In that respect: high performance modern microarchitectures are less energy-efficient © Dr. Avi Mendelson 30 Improving Hot Spots Clustering      Build your system as clustered architecture (e.g., Alpha) Design your system so that when all clusters are active the system exceeds the Max-Power allowed Most of the time, not all the clusters are active “Smart scheduling” will spread the thermal hotspots among different clusters. In VLIW based architectures, compilers can help © Dr. Avi Mendelson 31 Alpha hot spots Area 30% Freq. 50% Power 67% Source - CoolChips-99 © Dr. Avi Mendelson 32 Power Complexity Metrics   Power a C V2 f Metrics: suppose we introduce new feature that consumes extra x power and gain y performance: 1. Power/Perf ( Energy), assuming same technology (same C) and same voltage   2. Power/Perf2 ( Energy*Delay)  3. For battery life, energy bills. For a given power envelope – without voltage scaling. Balance performance and power needs. Power/Perf3 ( Energy*Delay2)  For a given power envelope – with voltage scaling. assuming that we can (1) trade frequency and voltage scaling, and (2) we can lower the voltage as much as we wish © Dr. Avi Mendelson 33 E*D product (lower is better)     E *D ~ Watt / MIPS 2 1 3 Delay  E = energy / instruction = Power * sec / instruction = Watt / MIPS D = sec / instruction = 1 / MIPS 2 1 0 0 0 1 2 Vdd (volts) 0 1 2 Vdd (volts) 3 400 ExD  Energy (PJ) 4 300 200 100 © Dr. Avi Mendelson 3 34 Leakage control  Leakage depends on: technology, area voltage and temperature.  High temperature  high leakage  high power  higher temperature  Leakage will be very significant in future microarchitectures.  Large caches contributes to the performance but may increase the power due to leakage.  Larger caches: better performance higher leakage -> slower clock -> lower performance.  Leakage make the major difference between clock gating and deep sleep modes (where power is disconnected) © Dr. Avi Mendelson 35 Design for power: Out Of Order Execution    OOO architecture was found to be very efficient in masking the effect L1 cache misses. Aggressive OOO, and wider machines require more registers and memory ports It consumes a lot of power    Can we slow down the access to the cache and let the OOO solve the performance problem? Can we simplify the OOO mechanisms, assuming that the memory subsystem limits the performance? How aggressive we should be as speculation (branch prediction, value prediction, etc) © Dr. Avi Mendelson 36 Pentium Pro Power Breakdown   Actual computation: less than 25%! What can be done:  Trace cache  Many low-level improvements Fetch 14% Misc 23% Decode 14% RAT 4% MOB 4% ROB 7% External Bus 6% RS 5% Clock 5% © Dr. Avi Mendelson Data $ 7% FP Exec 5% Int Exec 6% 37 SMT  Single CPU µArch augmented to look as 2 or more CPUs to the software  Adds ~10% logic to CPU (Alpha experience)  Average power increases <10%.  Can increase performance of two threads by 20-50% in respect of running the same applications sequentially.  Looks like a good tradeoffs between power and performance. © Dr. Avi Mendelson 38 MT - Implications on power     The area and the power consumption of register files and memory elements within the processor increases significantly due to aggressive out-of-order and aggressive SMT (Alpha, CoolChip, 99’) Increase the power at the hotspot, not fit to thermally limited segments (where performance is needed). May better tolerate cache misses, so power aware caches can be used Hot-spots may force us to use more aggressive clustering © Dr. Avi Mendelson 39 Question? © Dr. Avi Mendelson 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Pentium® 4 - Webcourse