Download L9_PM - Webcourse (tm)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MAMAS – Computer Architecture
Pentium® M Processor
Based on
The Intel® Pentium® M Processor:
Microarchitecture and Performance
Intel Technology Journal Q2/2003
http://developer.intel.com/technology/itj/
Dr. Lihu Rappoport
Lihu Rappoport, 12/2004
1
Intel Pentium® M processor
Intel® Centrino™ Mobile Technology
 Comprised of
– Pentium® M processor
– Mobile chipset
– Wireless Network connection
Intel®
Pentium® M
Processor
 Enables
– Integrated wireless LAN
capability
– Highest mobile performance
– Extended battery life
– Thinner, lighter designs
Intel® 855
Chipset
Family
Intel®
Pro/Wireless
2100 Network
Connection
ICH4-M
Lihu Rappoport, 12/2004
2
Intel Pentium® M processor
The Intel Pentium® M processor
 Intel’s first microprocessor designed specifically for mobility
– Achieve best performance at given power and thermal constraints

Different power/perf tradeoffs than a traditional high-performance processor
– Achieve longest battery life
 Power dissipation
– Power generates heat
– Transistors must be kept within their allowed operating temperature range

Heat has to be dissipated in a cost-effective manner
– Limit the processor’s peak power consumption


Applies both to desktops and mobile computers
Mobile computer’s smaller form-factor and lighter weight decrease the mobile
processor’s power budget
 Battery life
–
–
–
–
Batteries are designed to support a certain Watts × Hours
Higher average power  shorter battery life
Limits the processor’s average power consumption
Crucial factor for mobile computers, but less relevant for desktop computers
Lihu Rappoport, 12/2004
3
Intel Pentium® M processor
Pentium® M
Banias
Dothan
transistors
77M
140M
process
130nm
90nm
Die size
84 mm2
85mm2
Peak power
24.5 watts
21 watts
Freq
1.7 GHz
2.1GHz
L1 cache
32KB I$ + 32KB D$
32KB I$ + 32KB D$
L2 cache
1MB
2MB
Lihu Rappoport, 12/2004
4
Intel Pentium® M processor
Dothan Die
12.5 mm
6.6 mm
Lihu Rappoport, 12/2004
5
Intel Pentium® M processor
Higher Performance vs.
Longer Battery Life
 Processor average power is <10%
of platform
– The majority of power in the platform
is consumed by other components:
LCD, hard disk, memory and other
– The processor reduces power in
periods of low processor activity
– The processor enters lower power
states in idle periods
 Even an ideal processor can
extend battery life by 11% at most!
 Decision:
– Optimize for performance when
Active
– Optimize for battery life when idle
Intel®
LAN Fan
DVD
ICH
2% 2%
2%
3%
CLK
5%
Display
(panel + inverter)
33%
HDD
8%
GFX
8%
Misc.
8%
CPU
10%
Intel® MCH Power Supply
10%
9%
Source: 2004 Extended Battery Life Technologies,
Don J Nguyen, Intel Developer Forum, Spring 2003
Lihu Rappoport, 12/2004
6
Intel Pentium® M processor
Static Power
 The power consumed by a processor consists of
– Active power: used to switch transistors
– Static power: leakage of transistors under voltage
 Static power is a function of
– Number of transistors and their type
– Operating voltage
– Die temperature
 Leakage is growing dramatically
– Reaching 20% in current process technology, and growing
 Pentium® M reduces static power consumption
– The L2 cache is built with low-leaking transistors



L2 is 2/3 of the die transistors
Low-leaking transistors are slower, increasing cache access latency
The significant power saved justifies the small performance loss
– Enhanced SpeedStep® technology

Reduces voltage (and temperature), hence leakage, when processor
activity is low
Lihu Rappoport, 12/2004
7
Intel Pentium® M processor
Active Power
 Power is consumed when capacitance is charged/ discharged
– Changing 01 or 10
– The capacitance can be on transistors gates and on wires
 Power = αCV2f
– α: activity, C: capacitance, V: voltage, f: frequency
– Measured in watts
 Higher power  higher current and higher temperature
– Peak power cannot exceed the thermal constrains
 Power density
– Measured in watts/cm2
– Denser power is harder to cool
– Increased every process technology generation

higher power @ smaller die size
Lihu Rappoport, 12/2004
8
Intel Pentium® M processor
Energy & Average Power
 Energy = total of all switch energy and leakage waste
– Measured in either in joules or watt × hour
 Average power = Total energy / Total time
– Including low-activity and idle-time
 Typical figures (leading edge processors)
– Average power: 1W-3W
– Peak power: 20W-100W
Lihu Rappoport, 12/2004
9
Intel Pentium® M processor
Optimize for Performance
 Goal: Maximize performance at given thermal constraints
– Approximated by: Maximizing performance at given Power budget
 Processor power at a given voltage V0 and Frequency f0
P0 = αCV02f0
 Frequency approximated as linearly proportional to voltage
f0 = Kf × V0
 Leads to cubic dependency of power on the voltage
P0 = αCV03
 The test
“A micro-architectural feature that gains performance or saves power
should be better than simply using voltage/frequency scaling”
 It can be shown that the right Performance/Power tradeoff
1% more performance in less than 3% Power – a gain!
Lihu Rappoport, 12/2004
10
Intel Pentium® M processor
“Less is More”
 Less instructions per task
– Advanced branch prediction reduces #wrong instructions executed

Branch predictor logic consume power, but the gain is still positive
– SSE instructions reduce the number of instructions architecturally
 Less uops per instruction
– Uops fusion
– Dedicated stack engine
 Less transistor switches per micro-op
– efficient bus
– various lower-level optimizations
 Less energy per transistor switch
– Enhanced SpeedStep® technology
Power-awareness top to bottom
Lihu Rappoport, 12/2004
11
Intel Pentium® M processor
Loop predictor
 Pentium® M employs best-in-class branch prediction
– Bimodal predictor, Global predictor, Loop detector
– Indirect branch predictor
 Loop predictor: analyzes branches for loop behavior
– Moving in one direction (taken or NT) a fixed number of times
– Ended with a single movement in the opposite direction
 When such a branch is detected
– A set of counters are allocated
– Loop predicted completely accurately
– Also for larger iteration counts than
captured by global or local predictors
Count Limit Prediction
+1
=
0
Lihu Rappoport, 12/2004
12
Intel Pentium® M processor
Indirect Branch Predictor
 The target of indirect branches is data dependent
– Part of indirect branches still have a single target at run time
– Some have many targets

E.g., case statement in a Java byte-code interpreter
 Indirect branches heavily used in object-oriented code (C++, Java)

became a growing source of branch mispredictions
 Indirect branch is resolved at execution  high misprediction penalty
 A dedicated indirect branch target predictor (iTA)
– Chooses targets based on a global history
– Similar to global conditional branch predictor
 Initially indirect branch is allocated only in the target array (TA)
 If the target of an indirect branch is mispredicted by the TA
– Allocate an entry in the iTA corresponding to the global history leading to this
instance of the indirect branch
– Monotonic indirect branches are still predicted by the TA
– Data-dependent indirect branches allocate as many targets as needed
Lihu Rappoport, 12/2004
13
Intel Pentium® M processor
Indirect Branch Predictor (cont.)
 Prediction from the iTA is used if
– TA indicates an indirect branch
– iTA hits for the current global history
iTA hit by itself does not qualify a branch as indirect
Branch IP
hit
indirect branch
Target
Array
HIT
Target
Global history
Lihu Rappoport, 12/2004
Indirect Target
Predictor
14
Target
M
U
X
Predicted
Target
hit
Intel Pentium® M processor
Dedicated Stack Engine
 IA32 has HW-assisted stack management instructions
–
–
–
–
Push: ESP –= src_size;
Pop: dst ← MEM[ESP];
Call: ESP –= 4;
Ret: EIP ← MEM[ESP];
MEM[ESP] ← src;
ESP += src_size;
MEM[ESP] ← EIP;
ESP += 4;
EIP ← addr;
 Sequences of such instructions are quite common
– E.g., PUSHing a set of operands and then using a CALL on a
Function Call
 An additional uop updates the ESP register
– This uop adds or subtracts an immediate value to the ESP register
Lihu Rappoport, 12/2004
15
Intel Pentium® M processor
Dedicated Stack Engine
 Pentium ® M uses dedicated logic near the decoders to update ESP
 The programmer’s view of ESP (ESPP) is represented by
– ESPO – an historic ESP living in the out-of-order execution core
– ESPD – a delta maintained in the front end
ESPP := ESPO + ESPD
 When a sequence of PUSHes and POPs is encountered
– Accumulated delta value is passed across the decoders and updates ESPD
– ESPD value is patched into the address syllable of stack referencing uops

the AGU can calculate the proper memory location referenced by ESPP
Lihu Rappoport, 12/2004
16
Intel Pentium® M processor
Dedicated Stack Engine
 ESPD lives in the front-end  its calculations are speculative
– Need to be able to recover ESPD and ESPO value in case of a flush
– A dedicated table saves ESPD value for every instruction
– ESPO maintained by the OOO core as any other general-purpose register
– ESPP can be recovered for all instructions

Either pre- or post-execution

This allows for handling Faults or Traps as defined in IA32
 The architectural value of ESP may be needed in the OOO core
– E.g., when ESP is used in an address syllable, or: “XOR ESP,3”
– Decode logic inserts a sync uop that carries out the ESPP calculation
– Following a sync uop ESPD is cleared

the architectural value is now coherent
– A sync is not generated when the ESPD register is zero

Continued usage of ESP as a general-purpose register has no ill effects
Lihu Rappoport, 12/2004
17
Intel Pentium® M processor
Dedicated Stack Engine Benefits
 Dependencies on ESP are removed
– ESPO value used for scheduling in the out-of-order machine is not
changed during a sequence of stack operations
– The stack operations can be executed in parallel
 ESPD updates are done using a small dedicated adder
– Freeing the general execution units to work on other uops

Effectively increasing execution bandwidth
– Saves power: dedicated adders take less power than execution units
 ESP updates uops eliminated from the out-of-order machine
– Typically eliminates 5% of the uops (including the ESP sync uops)

Effectively increases decode bandwidth
 this is the major performance gain

Effectively increases ROB and RS size
– Saves power: eliminated uops don’t toggle bits throughout the machine

Energy per instruction decreases
Lihu Rappoport, 12/2004
18
Intel Pentium® M processor
Uop Fusion
 Out-of-order implementations IA32 break instructions into uops
– A conventional uop consists of a single operation operating on two sources
 The Instruction Decoder breaks an instruction into multiple uops
– whenever the instruction operates on more than two sources, or
– when the nature of the operation requires a sequence of operations
 Splitting the instruction into multiple uops also has its toll
– The increased number of uops creates pressure on resources with limited
bandwidth (rename, retire) or limited capacity (ROB, RS)
– Instructions that are decoded into >1 uop can only be decoded by decoder 0
– Delivering more uops through the system increases the energy required to
complete a given instruction sequence
 Pentium® M features uop fusion
– The Instruction Decoder fuses two uops into one uop
– The fused uop is seen as 1 uop in allocation, dispatch, and retirement
– Fused uops are executed as non-fused operations

Maintain the non-fused behavior benefits
– Reduce performance and energy cost while maintaining OOOE benefit
 Provides an effectively wider decoder, allocation, and retirement
Lihu Rappoport, 12/2004
19
Intel Pentium® M processor
Uop Fusion (cont.)
 The different domains in which the uop is fused and un-fused
– The instruction is decoded into a single fused uop by the decoder
– Fused uop allocated, renamed, and issued into a single entry in the ROB&RS

each RS entry can accommodate up to three source operands
Decode
Fused uops
domain
Alloc / RAT
RS
ROB
Exe.
Units
Un-Fused uops
domain
 When dispatching to the execution units
– The dispatcher controls the execution of each portion of the fused uop

according to the readiness of its sources
– Each portion is treated as if it occupied the whole entry for itself

Executed in the same way as a non-fused uop
Lihu Rappoport, 12/2004
20
Intel Pentium® M processor
Fused Store
 A store instruction is decoded as two independent uops
– store-address: calculates the address of the store
– store-data: stores the data into the Store Data buffer

The actual write to memory is done when the store retires
 Separating store-data & store-address is important for mem disambiguation
– Allows store-address to dispatch earlier, even before the stored data is known
– Address conflicts resolved earlier  opens the memory pipeline for other loads
 store-data and store-address can be issued to execution units in parallel
– Store-address dispatched to AGU when its sources (base and index reg) are ready
– Store-data is dispatched to the store data buffer unit independently, when its source
operand is available
 Fused store can retire only after both operations complete
Decoded and renamed Fused
store uop
Dispatch Store Data
Save faults in Register File
Dispatch Store Address Save
faults in Register File
Retire values when both
operations completed
Lihu Rappoport, 12/2004
21
Intel Pentium® M processor
Fused Load-Op
 A load-op (read-modify) instruction
consists of two uops
– Read the operand from an address in memory
– Calculates result based on 1st operand and a
register operand (and write result to register)
 A load-op instruction may have up to 3
register operands
– it must be implemented by two uops
 The two operations are inherently serial
Decode and rename load-op
instruction into fused uop
Dispatch Load
Save faults in Register File
– The Op cannot start until the Load completes
 The load and the op are issued serially to
the relevant execution units
– The load is dispatched when its sources (base
and index registers) are ready
– The op can be dispatched only after the load
completes and the other operand is ready
Dispatch Op
Save faults in Register File
Retire values when both
operations completed
 A fused load-op uop can retire only after
both operations complete
Lihu Rappoport, 12/2004
22
Intel Pentium® M processor
Uop Fusion – Best of all Worlds
add eax, dword ptr data
Decoder
LD
OP
Scheduler
LD
OP
Cache
Lihu Rappoport, 12/2004
ALU
23
Intel Pentium® M processor
Uop Fusion – Best of all Worlds
add eax, dword ptr data
Micro-op fusion
enables effective
machine
utilization
Decoder
LD + OP
Scheduler
LD + OP
Independent uOp
OOO/Superscalar execution
Cache
ALU
LD
OP
Achieving >10% of Micro-op reduction
Lihu Rappoport, 12/2004
24
Intel Pentium® M processor
Uop Fusion Performance
 Uop fusion reduces #uops handled by the OOO logic by >10%
– Increases performance by effectively widening issue, rename, and retire
 Biggest boost is obtained during bursts of memory operations
– All decoders can decode instructions (instead of only decoder 0)
– Practically widens the processor decode, allocation, and retirement
bandwidth by a factor of three
 The typical performance increase of the uop fusion
– Integer code: 5%, most of it from Store fusion
– FP code: 9%, equally from the two types of fused uops
 Delivering less uops through the processor decreases the
energy required to complete a given instruction sequence
– The same task is accomplished by processing fewer uops
 Power reduction is positive
– More power reduced than the power added for the uop fusion logic
Lihu Rappoport, 12/2004
25
Intel Pentium® M processor
Idle Periods Prediction
 Predict idle periods and instruct units to reduce power
– Either by shutting off their clocks or by disabling parts of their logic
– Resume operations seamlessly with no performance penalty
 Power predictor example: the Allocate stall predictor
– Whenever the ROB is full, the Allocator stalls the pipeline
– The Allocator cannot tell if the ROB will remain full on the next cycle

Needs to re-evaluate the stall condition every cycle
– It turns out that in many cases when the ROB is full, it stays so for
very long periods
– Predictor collects information from the ROB and other units


To predict the nature of the next cycle
Instruct Allocator to continue stalling and shut off its clocks
Lihu Rappoport, 12/2004
26
Intel Pentium® M processor
Execution Units Stacking
 Identify and activate parts of the processor needed for a
specific operation
– EU’s attached to an execution port share the same source bus wires
– Drive only the wires that belong to the target EU
– EU’s are divided into a few segments (stacks)

Special logic controls the data flow to each stack according to its
actual destination
Lihu Rappoport, 12/2004
27
Intel Pentium® M processor
Early identification of EU width
 IA32 processors operate on data types with different widths
– Integer operations, operating on 32 bits – the most common
– Floating-point operations, operating on 80 bits
– Multimedia operations, operating on 64 bits or 128 bits
 Toggling a wider bus and reading from a bigger register file
consumes more power than is actually required
 Integer operations are identified in advance
– Narrower buses to and from the EU during dispatch and write-back
– Renaming logic unused for integer operations are not activated
 Effectively transforms the processor into a 32-bit machine
– Utilize only resources needed for integer operations while operating on integers
Lihu Rappoport, 12/2004
28
Intel Pentium® M processor
Backup
Lihu Rappoport, 12/2004
29
Intel Pentium® M processor
Performance/Power Tradeoff Zones
100%
ConstrainedPerformance
Breakeven line
80%
Wrong trade-off zone
60%
Energy Loss
Constrained Perf
Gain
40%
Energy
Breakeven
line
20%
30
%
27
%
24
%
21
%
18
%
15
%
12
%
9%
6%
3%
0%
-3
%
-6
%
-9
%
-1
2%
-1
5%
-1
8%
-2
1%
-2
4%
0%
-2
7%
-3
0% | Power
Loss=>
Gain


<=Power
Power Gain
Po wer Loss
Energy Loss
Constrained Perf Loss
-20%
Energy Gain
Constrained Perf
Loss
Energy Gain
Constrained Perf
Gain
-40%
-60%
 Performance
Performance loss
<=
Lihu Rappoport, 12/2004
|
30
Performance gain
Performance
gain 
=>
Intel Pentium® M processor
The Pentium® M Bus




Power saving is achieved by protocol and circuit methods
The bus supports 100MHz bus clock with a data rate of 400M transfers/sec
It is a latched bus with an in-order queue of 8-pipelined transactions
The bus is optimized for a mobile-processor environment
– Support only uni-processor

Mobile systems power budget cannot support dual processors anyway
– Only 32 address bits that cover 4GB of physical address space
 The Bus saves power aggressively when idle
– controls its input buffer’s sense-amplifiers that sample the activity on the bus
– When the bus is idle, sense amplifiers are disabled and do not consume any power
– When the bus is active and address and data are driven on the bus, the input buffers
are enabled in advance to ensure all information is captured with no delay
 Data Bus Power Control Signal (DPWR#)
– driven by the 855PM chipset whenever data are transferred to the processor
– DPWR# is used to dynamically enable the processor’s 64-bit data bus input sense
amplifiers and their related controls (~80 signals) only when data are transferred to the
bus
 BPRI Control
– This is a method to achieve the DPWR# functionality for the address bus
– BPRI# is asserted whenever the 855PM chipset attempts to drive the bus.

Used to dynamically enable the 32-bit address bus input sense amplifiers and their
related controls (~40 signals) only when a transaction is issued to the bus
Lihu Rappoport, 12/2004
31
Intel Pentium® M processor
The Pentium® M Bus
 Low Vtt:
– The processor’s I/O buffers work at a low voltage of 1.05V (Vtt).
– The low Vtt is an essential element to reduce the bus power.
– Operating at low Vtt introduces a new set of problems

The I/O buffer is working at the low linear point, which affects the buffer’s
characteristics.
– The bus includes a special Resistor Compensation (RCOMP) method to
adjust the buffer strength dynamically during run time
– Accommodates the impacts of temperature, voltage drift, and bus topology
– At any thermal and power state the bus has full impedance termination
– It has split power planes that allow setting the I/O operating voltage to a
fixed value of 1.05V even though the core may be operating at a higher
Enhanced Intel SpeedStepTM technology operating point.
 PSI: Power Status Indicator
– Driven by the processor to control the current consumption of the Voltage
Regulator when the processor operates at a low power state
– Reduces the overall platform power (not just the processor power!)
Lihu Rappoport, 12/2004
32
Intel Pentium® M processor
Enhanced SpeedStep™
Technology
 The “Basic” SpeedStep™ Technology had
– 2 operating points
– Non-transparent switch
 The “Enhanced” version provides
– Multi voltage/frequency operating points. The Pentium M processor 1.6GHz operation ranges:

From 600MHz @ 0.956V
To 1.6GHz @ 1.484V
– Transparent switch
– Frequent switches
 Benefits
Freq (GHz)
Power (Watts)
3.6
6.1X
18
16
2.8
14
Efficiency
ratio = 2.3
2.4
12
2.0
10
1.2
2.7X
)
(GHz
1.6
8
6
0.8
4
0.4
2
0.0
0
0.8
Lihu Rappoport, 12/2004
20
3.2
Frequency
– Higher power efficiency
2.7X lower frequency 
2X performance loss 
>2X energy gain
– Outstanding battery life
– Excellent thermal mgmt.
Voltage, Frequency, Power
4.0
33
1.0
1.2
Voltage (Volt)
1.4
1.6
Intel Pentium® M processor
Typical Power

Voltage, Power, Frequency
 Transistor switches faster at higher voltage
 higher voltage enables higher frequency
 Maximum frequency grows about linearly with voltage.
…Within a given voltage range Vmin-Vmax.
– V < Vmin
 transistors won’t switch.
– V > Vmax
 the device may burn.
1000
 “The cube law”:
XScale processor freq. & power vs. voltage *
P  kV3
900
800
(or ~1%V = 3%P)
Fequency (Mhz)
 Implications
Power (mWatt )
700
– Can save energy/power when
Performance is not a factor
600
500
400
300
200
100
0
0.5
* Source: Intel Corp. (http://developer.intel.com)
Lihu Rappoport, 12/2004
34
0.7
0.9
1.1
1.3
1.5
1.7
Intel Pentium® M processor
1.9