Download L9_PM - Webcourse (tm)

MAMAS – Computer Architecture Pentium® M Processor Based on The Intel® Pentium® M Processor: Microarchitecture and Performance Intel Technology Journal Q2/2003 http://developer.intel.com/technology/itj/ Dr. Lihu Rappoport Lihu Rappoport, 12/2004 1 Intel Pentium® M processor Intel® Centrino™ Mobile Technology  Comprised of – Pentium® M processor – Mobile chipset – Wireless Network connection Intel® Pentium® M Processor  Enables – Integrated wireless LAN capability – Highest mobile performance – Extended battery life – Thinner, lighter designs Intel® 855 Chipset Family Intel® Pro/Wireless 2100 Network Connection ICH4-M Lihu Rappoport, 12/2004 2 Intel Pentium® M processor The Intel Pentium® M processor  Intel’s first microprocessor designed specifically for mobility – Achieve best performance at given power and thermal constraints  Different power/perf tradeoffs than a traditional high-performance processor – Achieve longest battery life  Power dissipation – Power generates heat – Transistors must be kept within their allowed operating temperature range  Heat has to be dissipated in a cost-effective manner – Limit the processor’s peak power consumption   Applies both to desktops and mobile computers Mobile computer’s smaller form-factor and lighter weight decrease the mobile processor’s power budget  Battery life – – – – Batteries are designed to support a certain Watts × Hours Higher average power  shorter battery life Limits the processor’s average power consumption Crucial factor for mobile computers, but less relevant for desktop computers Lihu Rappoport, 12/2004 3 Intel Pentium® M processor Pentium® M Banias Dothan transistors 77M 140M process 130nm 90nm Die size 84 mm2 85mm2 Peak power 24.5 watts 21 watts Freq 1.7 GHz 2.1GHz L1 cache 32KB I$ + 32KB D$ 32KB I$ + 32KB D$ L2 cache 1MB 2MB Lihu Rappoport, 12/2004 4 Intel Pentium® M processor Dothan Die 12.5 mm 6.6 mm Lihu Rappoport, 12/2004 5 Intel Pentium® M processor Higher Performance vs. Longer Battery Life  Processor average power is <10% of platform – The majority of power in the platform is consumed by other components: LCD, hard disk, memory and other – The processor reduces power in periods of low processor activity – The processor enters lower power states in idle periods  Even an ideal processor can extend battery life by 11% at most!  Decision: – Optimize for performance when Active – Optimize for battery life when idle Intel® LAN Fan DVD ICH 2% 2% 2% 3% CLK 5% Display (panel + inverter) 33% HDD 8% GFX 8% Misc. 8% CPU 10% Intel® MCH Power Supply 10% 9% Source: 2004 Extended Battery Life Technologies, Don J Nguyen, Intel Developer Forum, Spring 2003 Lihu Rappoport, 12/2004 6 Intel Pentium® M processor Static Power  The power consumed by a processor consists of – Active power: used to switch transistors – Static power: leakage of transistors under voltage  Static power is a function of – Number of transistors and their type – Operating voltage – Die temperature  Leakage is growing dramatically – Reaching 20% in current process technology, and growing  Pentium® M reduces static power consumption – The L2 cache is built with low-leaking transistors    L2 is 2/3 of the die transistors Low-leaking transistors are slower, increasing cache access latency The significant power saved justifies the small performance loss – Enhanced SpeedStep® technology  Reduces voltage (and temperature), hence leakage, when processor activity is low Lihu Rappoport, 12/2004 7 Intel Pentium® M processor Active Power  Power is consumed when capacitance is charged/ discharged – Changing 01 or 10 – The capacitance can be on transistors gates and on wires  Power = αCV2f – α: activity, C: capacitance, V: voltage, f: frequency – Measured in watts  Higher power  higher current and higher temperature – Peak power cannot exceed the thermal constrains  Power density – Measured in watts/cm2 – Denser power is harder to cool – Increased every process technology generation  higher power @ smaller die size Lihu Rappoport, 12/2004 8 Intel Pentium® M processor Energy & Average Power  Energy = total of all switch energy and leakage waste – Measured in either in joules or watt × hour  Average power = Total energy / Total time – Including low-activity and idle-time  Typical figures (leading edge processors) – Average power: 1W-3W – Peak power: 20W-100W Lihu Rappoport, 12/2004 9 Intel Pentium® M processor Optimize for Performance  Goal: Maximize performance at given thermal constraints – Approximated by: Maximizing performance at given Power budget  Processor power at a given voltage V0 and Frequency f0 P0 = αCV02f0  Frequency approximated as linearly proportional to voltage f0 = Kf × V0  Leads to cubic dependency of power on the voltage P0 = αCV03  The test “A micro-architectural feature that gains performance or saves power should be better than simply using voltage/frequency scaling”  It can be shown that the right Performance/Power tradeoff 1% more performance in less than 3% Power – a gain! Lihu Rappoport, 12/2004 10 Intel Pentium® M processor “Less is More”  Less instructions per task – Advanced branch prediction reduces #wrong instructions executed  Branch predictor logic consume power, but the gain is still positive – SSE instructions reduce the number of instructions architecturally  Less uops per instruction – Uops fusion – Dedicated stack engine  Less transistor switches per micro-op – efficient bus – various lower-level optimizations  Less energy per transistor switch – Enhanced SpeedStep® technology Power-awareness top to bottom Lihu Rappoport, 12/2004 11 Intel Pentium® M processor Loop predictor  Pentium® M employs best-in-class branch prediction – Bimodal predictor, Global predictor, Loop detector – Indirect branch predictor  Loop predictor: analyzes branches for loop behavior – Moving in one direction (taken or NT) a fixed number of times – Ended with a single movement in the opposite direction  When such a branch is detected – A set of counters are allocated – Loop predicted completely accurately – Also for larger iteration counts than captured by global or local predictors Count Limit Prediction +1 = 0 Lihu Rappoport, 12/2004 12 Intel Pentium® M processor Indirect Branch Predictor  The target of indirect branches is data dependent – Part of indirect branches still have a single target at run time – Some have many targets  E.g., case statement in a Java byte-code interpreter  Indirect branches heavily used in object-oriented code (C++, Java)  became a growing source of branch mispredictions  Indirect branch is resolved at execution  high misprediction penalty  A dedicated indirect branch target predictor (iTA) – Chooses targets based on a global history – Similar to global conditional branch predictor  Initially indirect branch is allocated only in the target array (TA)  If the target of an indirect branch is mispredicted by the TA – Allocate an entry in the iTA corresponding to the global history leading to this instance of the indirect branch – Monotonic indirect branches are still predicted by the TA – Data-dependent indirect branches allocate as many targets as needed Lihu Rappoport, 12/2004 13 Intel Pentium® M processor Indirect Branch Predictor (cont.)  Prediction from the iTA is used if – TA indicates an indirect branch – iTA hits for the current global history iTA hit by itself does not qualify a branch as indirect Branch IP hit indirect branch Target Array HIT Target Global history Lihu Rappoport, 12/2004 Indirect Target Predictor 14 Target M U X Predicted Target hit Intel Pentium® M processor Dedicated Stack Engine  IA32 has HW-assisted stack management instructions – – – – Push: ESP –= src_size; Pop: dst ← MEM[ESP]; Call: ESP –= 4; Ret: EIP ← MEM[ESP]; MEM[ESP] ← src; ESP += src_size; MEM[ESP] ← EIP; ESP += 4; EIP ← addr;  Sequences of such instructions are quite common – E.g., PUSHing a set of operands and then using a CALL on a Function Call  An additional uop updates the ESP register – This uop adds or subtracts an immediate value to the ESP register Lihu Rappoport, 12/2004 15 Intel Pentium® M processor Dedicated Stack Engine  Pentium ® M uses dedicated logic near the decoders to update ESP  The programmer’s view of ESP (ESPP) is represented by – ESPO – an historic ESP living in the out-of-order execution core – ESPD – a delta maintained in the front end ESPP := ESPO + ESPD  When a sequence of PUSHes and POPs is encountered – Accumulated delta value is passed across the decoders and updates ESPD – ESPD value is patched into the address syllable of stack referencing uops  the AGU can calculate the proper memory location referenced by ESPP Lihu Rappoport, 12/2004 16 Intel Pentium® M processor Dedicated Stack Engine  ESPD lives in the front-end  its calculations are speculative – Need to be able to recover ESPD and ESPO value in case of a flush – A dedicated table saves ESPD value for every instruction – ESPO maintained by the OOO core as any other general-purpose register – ESPP can be recovered for all instructions  Either pre- or post-execution  This allows for handling Faults or Traps as defined in IA32  The architectural value of ESP may be needed in the OOO core – E.g., when ESP is used in an address syllable, or: “XOR ESP,3” – Decode logic inserts a sync uop that carries out the ESPP calculation – Following a sync uop ESPD is cleared  the architectural value is now coherent – A sync is not generated when the ESPD register is zero  Continued usage of ESP as a general-purpose register has no ill effects Lihu Rappoport, 12/2004 17 Intel Pentium® M processor Dedicated Stack Engine Benefits  Dependencies on ESP are removed – ESPO value used for scheduling in the out-of-order machine is not changed during a sequence of stack operations – The stack operations can be executed in parallel  ESPD updates are done using a small dedicated adder – Freeing the general execution units to work on other uops  Effectively increasing execution bandwidth – Saves power: dedicated adders take less power than execution units  ESP updates uops eliminated from the out-of-order machine – Typically eliminates 5% of the uops (including the ESP sync uops)  Effectively increases decode bandwidth  this is the major performance gain  Effectively increases ROB and RS size – Saves power: eliminated uops don’t toggle bits throughout the machine  Energy per instruction decreases Lihu Rappoport, 12/2004 18 Intel Pentium® M processor Uop Fusion  Out-of-order implementations IA32 break instructions into uops – A conventional uop consists of a single operation operating on two sources  The Instruction Decoder breaks an instruction into multiple uops – whenever the instruction operates on more than two sources, or – when the nature of the operation requires a sequence of operations  Splitting the instruction into multiple uops also has its toll – The increased number of uops creates pressure on resources with limited bandwidth (rename, retire) or limited capacity (ROB, RS) – Instructions that are decoded into >1 uop can only be decoded by decoder 0 – Delivering more uops through the system increases the energy required to complete a given instruction sequence  Pentium® M features uop fusion – The Instruction Decoder fuses two uops into one uop – The fused uop is seen as 1 uop in allocation, dispatch, and retirement – Fused uops are executed as non-fused operations  Maintain the non-fused behavior benefits – Reduce performance and energy cost while maintaining OOOE benefit  Provides an effectively wider decoder, allocation, and retirement Lihu Rappoport, 12/2004 19 Intel Pentium® M processor Uop Fusion (cont.)  The different domains in which the uop is fused and un-fused – The instruction is decoded into a single fused uop by the decoder – Fused uop allocated, renamed, and issued into a single entry in the ROB&RS  each RS entry can accommodate up to three source operands Decode Fused uops domain Alloc / RAT RS ROB Exe. Units Un-Fused uops domain  When dispatching to the execution units – The dispatcher controls the execution of each portion of the fused uop  according to the readiness of its sources – Each portion is treated as if it occupied the whole entry for itself  Executed in the same way as a non-fused uop Lihu Rappoport, 12/2004 20 Intel Pentium® M processor Fused Store  A store instruction is decoded as two independent uops – store-address: calculates the address of the store – store-data: stores the data into the Store Data buffer  The actual write to memory is done when the store retires  Separating store-data & store-address is important for mem disambiguation – Allows store-address to dispatch earlier, even before the stored data is known – Address conflicts resolved earlier  opens the memory pipeline for other loads  store-data and store-address can be issued to execution units in parallel – Store-address dispatched to AGU when its sources (base and index reg) are ready – Store-data is dispatched to the store data buffer unit independently, when its source operand is available  Fused store can retire only after both operations complete Decoded and renamed Fused store uop Dispatch Store Data Save faults in Register File Dispatch Store Address Save faults in Register File Retire values when both operations completed Lihu Rappoport, 12/2004 21 Intel Pentium® M processor Fused Load-Op  A load-op (read-modify) instruction consists of two uops – Read the operand from an address in memory – Calculates result based on 1st operand and a register operand (and write result to register)  A load-op instruction may have up to 3 register operands – it must be implemented by two uops  The two operations are inherently serial Decode and rename load-op instruction into fused uop Dispatch Load Save faults in Register File – The Op cannot start until the Load completes  The load and the op are issued serially to the relevant execution units – The load is dispatched when its sources (base and index registers) are ready – The op can be dispatched only after the load completes and the other operand is ready Dispatch Op Save faults in Register File Retire values when both operations completed  A fused load-op uop can retire only after both operations complete Lihu Rappoport, 12/2004 22 Intel Pentium® M processor Uop Fusion – Best of all Worlds add eax, dword ptr data Decoder LD OP Scheduler LD OP Cache Lihu Rappoport, 12/2004 ALU 23 Intel Pentium® M processor Uop Fusion – Best of all Worlds add eax, dword ptr data Micro-op fusion enables effective machine utilization Decoder LD + OP Scheduler LD + OP Independent uOp OOO/Superscalar execution Cache ALU LD OP Achieving >10% of Micro-op reduction Lihu Rappoport, 12/2004 24 Intel Pentium® M processor Uop Fusion Performance  Uop fusion reduces #uops handled by the OOO logic by >10% – Increases performance by effectively widening issue, rename, and retire  Biggest boost is obtained during bursts of memory operations – All decoders can decode instructions (instead of only decoder 0) – Practically widens the processor decode, allocation, and retirement bandwidth by a factor of three  The typical performance increase of the uop fusion – Integer code: 5%, most of it from Store fusion – FP code: 9%, equally from the two types of fused uops  Delivering less uops through the processor decreases the energy required to complete a given instruction sequence – The same task is accomplished by processing fewer uops  Power reduction is positive – More power reduced than the power added for the uop fusion logic Lihu Rappoport, 12/2004 25 Intel Pentium® M processor Idle Periods Prediction  Predict idle periods and instruct units to reduce power – Either by shutting off their clocks or by disabling parts of their logic – Resume operations seamlessly with no performance penalty  Power predictor example: the Allocate stall predictor – Whenever the ROB is full, the Allocator stalls the pipeline – The Allocator cannot tell if the ROB will remain full on the next cycle  Needs to re-evaluate the stall condition every cycle – It turns out that in many cases when the ROB is full, it stays so for very long periods – Predictor collects information from the ROB and other units   To predict the nature of the next cycle Instruct Allocator to continue stalling and shut off its clocks Lihu Rappoport, 12/2004 26 Intel Pentium® M processor Execution Units Stacking  Identify and activate parts of the processor needed for a specific operation – EU’s attached to an execution port share the same source bus wires – Drive only the wires that belong to the target EU – EU’s are divided into a few segments (stacks)  Special logic controls the data flow to each stack according to its actual destination Lihu Rappoport, 12/2004 27 Intel Pentium® M processor Early identification of EU width  IA32 processors operate on data types with different widths – Integer operations, operating on 32 bits – the most common – Floating-point operations, operating on 80 bits – Multimedia operations, operating on 64 bits or 128 bits  Toggling a wider bus and reading from a bigger register file consumes more power than is actually required  Integer operations are identified in advance – Narrower buses to and from the EU during dispatch and write-back – Renaming logic unused for integer operations are not activated  Effectively transforms the processor into a 32-bit machine – Utilize only resources needed for integer operations while operating on integers Lihu Rappoport, 12/2004 28 Intel Pentium® M processor Backup Lihu Rappoport, 12/2004 29 Intel Pentium® M processor Performance/Power Tradeoff Zones 100% ConstrainedPerformance Breakeven line 80% Wrong trade-off zone 60% Energy Loss Constrained Perf Gain 40% Energy Breakeven line 20% 30 % 27 % 24 % 21 % 18 % 15 % 12 % 9% 6% 3% 0% -3 % -6 % -9 % -1 2% -1 5% -1 8% -2 1% -2 4% 0% -2 7% -3 0% | Power Loss=> Gain   <=Power Power Gain Po wer Loss Energy Loss Constrained Perf Loss -20% Energy Gain Constrained Perf Loss Energy Gain Constrained Perf Gain -40% -60%  Performance Performance loss <= Lihu Rappoport, 12/2004 | 30 Performance gain Performance gain  => Intel Pentium® M processor The Pentium® M Bus     Power saving is achieved by protocol and circuit methods The bus supports 100MHz bus clock with a data rate of 400M transfers/sec It is a latched bus with an in-order queue of 8-pipelined transactions The bus is optimized for a mobile-processor environment – Support only uni-processor  Mobile systems power budget cannot support dual processors anyway – Only 32 address bits that cover 4GB of physical address space  The Bus saves power aggressively when idle – controls its input buffer’s sense-amplifiers that sample the activity on the bus – When the bus is idle, sense amplifiers are disabled and do not consume any power – When the bus is active and address and data are driven on the bus, the input buffers are enabled in advance to ensure all information is captured with no delay  Data Bus Power Control Signal (DPWR#) – driven by the 855PM chipset whenever data are transferred to the processor – DPWR# is used to dynamically enable the processor’s 64-bit data bus input sense amplifiers and their related controls (~80 signals) only when data are transferred to the bus  BPRI Control – This is a method to achieve the DPWR# functionality for the address bus – BPRI# is asserted whenever the 855PM chipset attempts to drive the bus.  Used to dynamically enable the 32-bit address bus input sense amplifiers and their related controls (~40 signals) only when a transaction is issued to the bus Lihu Rappoport, 12/2004 31 Intel Pentium® M processor The Pentium® M Bus  Low Vtt: – The processor’s I/O buffers work at a low voltage of 1.05V (Vtt). – The low Vtt is an essential element to reduce the bus power. – Operating at low Vtt introduces a new set of problems  The I/O buffer is working at the low linear point, which affects the buffer’s characteristics. – The bus includes a special Resistor Compensation (RCOMP) method to adjust the buffer strength dynamically during run time – Accommodates the impacts of temperature, voltage drift, and bus topology – At any thermal and power state the bus has full impedance termination – It has split power planes that allow setting the I/O operating voltage to a fixed value of 1.05V even though the core may be operating at a higher Enhanced Intel SpeedStepTM technology operating point.  PSI: Power Status Indicator – Driven by the processor to control the current consumption of the Voltage Regulator when the processor operates at a low power state – Reduces the overall platform power (not just the processor power!) Lihu Rappoport, 12/2004 32 Intel Pentium® M processor Enhanced SpeedStep™ Technology  The “Basic” SpeedStep™ Technology had – 2 operating points – Non-transparent switch  The “Enhanced” version provides – Multi voltage/frequency operating points. The Pentium M processor 1.6GHz operation ranges:  From 600MHz @ 0.956V To 1.6GHz @ 1.484V – Transparent switch – Frequent switches  Benefits Freq (GHz) Power (Watts) 3.6 6.1X 18 16 2.8 14 Efficiency ratio = 2.3 2.4 12 2.0 10 1.2 2.7X ) (GHz 1.6 8 6 0.8 4 0.4 2 0.0 0 0.8 Lihu Rappoport, 12/2004 20 3.2 Frequency – Higher power efficiency 2.7X lower frequency  2X performance loss  >2X energy gain – Outstanding battery life – Excellent thermal mgmt. Voltage, Frequency, Power 4.0 33 1.0 1.2 Voltage (Volt) 1.4 1.6 Intel Pentium® M processor Typical Power  Voltage, Power, Frequency  Transistor switches faster at higher voltage  higher voltage enables higher frequency  Maximum frequency grows about linearly with voltage. …Within a given voltage range Vmin-Vmax. – V < Vmin  transistors won’t switch. – V > Vmax  the device may burn. 1000  “The cube law”: XScale processor freq. & power vs. voltage * P  kV3 900 800 (or ~1%V = 3%P) Fequency (Mhz)  Implications Power (mWatt ) 700 – Can save energy/power when Performance is not a factor 600 500 400 300 200 100 0 0.5 * Source: Intel Corp. (http://developer.intel.com) Lihu Rappoport, 12/2004 34 0.7 0.9 1.1 1.3 1.5 1.7 Intel Pentium® M processor 1.9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download L9_PM - Webcourse (tm)