Download HC19.21.810.45nm Next Generation Intel® Core

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Wireless power transfer wikipedia , lookup

Electric power system wikipedia , lookup

History of electric power transmission wikipedia , lookup

Power over Ethernet wikipedia , lookup

Electrification wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Audio power wikipedia , lookup

Alternating current wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Power engineering wikipedia , lookup

Magnetic-core memory wikipedia , lookup

Magnetic core wikipedia , lookup

Microprocessor wikipedia , lookup

Transcript
45nm Next Generation
Intel® Core™
Microarchitecture (Penryn)
HOT CHIPS 2007
Varghese George
Principal Engineer, Intel Corp
Legal Disclaimer
Today’s presentation may contain forward-looking statements. All statements made that are not historical facts are subject
to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings
Release and our most recent Form 10-Q or 10-K filing available on our website for more information on the risk factors that
could cause actual results to differ.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED
BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED
WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life
saving, or life sustaining applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined."
Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising
from future changes to them.
The Intel® Core™ Microarchitecture, Intel® Pentium, Intel® Pentium II, Intel® Pentium III, Intel® Pentium 4, Intel® Pentium
Pro, Intel® Pentium D, Intel® Pentium M , Itanium®, Xeon® may contain design defects or errors known as errata which
may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product
order.
This document contains information on products in the design phase of development. Do not finalize a design with this
information. Revised information will be published when the product is available. Verify with your local sales office that you
have the latest datasheet before finalizing a design.
All dates specified are target dates, are provided for planning purposes only and are subject to change.
All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposes
only, and are subject to change without notice.
Intel and the Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and
other countries.
*Other names and brands are the property of their respective owners.
Copyright © 2007, Intel Corporation
2
Outline
Next Generation 45nm Process Technology benefits
Overview of Penryn Architecture and New Features
Highlights of New Microarchitecture changes
New Power Management Features in Penryn
• Deep Power Down Technology
• Enhanced Dynamic Acceleration Technology
• CC3 in Servers
Summary
3
Next generation 45 nm Hi-K metal gate process
technology: Continuing Moore’s Law
65 nm
Intel® Core™ 2 Duo
143 mm2
4-wide Dynamic Execution
Advanced smart cache: 4M L2
Smart Memory access
Advanced digital media boost
Intelligent power capability
45 nm
Penryn family die (dual core)
107 mm2
Same good stuff plus:
- 6M L2 cache
- 47 new SSE4 instructions
- Micro-arch enhancements for further IPC perf
- Deep Power Down Technology*
- Enhanced Dynamic Acceleration Technology*
Penryn advantages with this next generation process:
• Lower transistor switching power and leakage current => reduced idle power
=> longer battery life, quieter systems
• Up to 2x improvement in transistor density: Significant area scalability to
support additional capabilities within the existing power envelope
• Higher core and bus frequency for more performance in the same power
envelope
* Mobile only features
4
Enhanced Intel® Core™ Microarchitecture
Intel Core Microarchitecture
New with the Penryn Family
Fast Radix-16 Divider
Faster OS Primitive Support
Enhanced Intel Virtualization Technology
Intel® Wide Dynamic Execution
Intel® Advanced Smart Cache
Larger Caches: up to 12MB
24 Way Set Associativity
Intel® Smart Memory Access
Improved Store Forwarding
Higher bus speeds
Intel® Advanced Digital Media Boost
Intel® Intelligent Power Capability
Intel SSE4 instructions
Super Shuffle Engine
Deep Power Down Technology
Enhanced Intel Dynamic Acceleration Tech
Increased Performance and Energy
Efficiency Across Applications
Covered briefly in this presentation
Focus of this presentation
5
New Intel® SSE4 Instructions
Intel®
SSE
1999
70
instructions
•SinglePrecision
Vectors,
•streaming
operations
Intel®
SSE2
Intel®
SSE3
Intel®
SSSE3
2000
2004
2006
144
instructions
13
instructions
32
instructions
•Doubleprecision
Vectors
•Complex
Arithmetic
• Decode
•128-bit
vector
integer
Intel®
SSE4
2007
47 instructions
• Video Accelerators
• Graphics building blocks
• Coprocessor Accelerators
• Penryn supports 47 new instructions
• Largest set of new instructions introduced since 2000
• This continues a trend set by SSE, SSE2, SSE3, and SSSE3
• Addresses some popular requests for key functionality
• Features to improve the compiler’
compiler’s ability to efficiently vectorize code
• Video Encode acceleration functions
• Floating-point dot product operation (3D content)
• Streaming load for high b/w to WC memory (imaging, GPU-CPU sharing)
6
Super Shuffle Engine
Input data (2 x 16 bytes)
src1
src2
B0
B0
B1
B1
B2
B2
B3
B3
B4
B4
...
...
B14 B15
• Shuffle operations required for SSE
data formatting operations
•
•
•
•
•
•
B14 B15
...
...
Super
Shuffle
engine
...
logic op
logic op
Unpacking
Packing
Align concatenated sources
Wide shifts
Insertion and extraction
Setup for horizontal arithmetic
functions
• Penryn super shuffle engine
performs 128 bit operation in a
single cycle
...
dest
B0
B1
B2
B3
B4
...
B14 B15
• No software changes required
Doubles shuffle throughput
Output data (16 bytes)
7
New Radix-16 divider architecture
Dividend
Divisor
• Leading edge divider performance
• >50% speed-up over previous generations on average
Pre-processing
QSL
hybrid 68b
adder
• Innovative radix-16 based architecture
• Double pumped: processes 2x2 bits per cycle
• Hybrid 68-bit CSA/CPA based for faster pipelined
quotient select
• Optimized square root
<<2
Maximum Processor Instruction Latency
(lower is better)
hybrid 68b
adder
<<2
Clocks
QSL
Post-processing
Quotient/remainder
32-bit
int
divide
SP divide
DP
divide
EP
divide
Previous generation
DP
SP
square square
root
Penryn root
EP
square
root
8
Streaming Load Instruction
• 16 byte Aligned load instruction on WC (write combining) memory
• Improves read bandwidth from WC mem by reading cache-line size quantities
Temp Buffer inside
L1 Data Cache
C
D
GFX
But … Low BW Reads
800 MB/s (partial reads)
45nm Next
4th
Access
Access
2nd Access
1st Access
3rd
(Data usage in processor core)
Generation
Core ™
GFX
Fast Streaming Reads!
DDR3
B
Core ™
DDR3
A
High BW Writes
With Streaming
Loads
L2 Cache
DDR3
DDR3
Bus
Existing
Platforms
Graphics Mem
Up to 6.4 GB/s (cache line reads)
Streaming Load is ~8X faster reading from WC Memory
9
Power Management Overview
Penryn builds on the capabilities in the Core™ microarchitecture
• Fine grained clock gating during runtime: Applied all across the chip
• Sleep transistors: In large arrays such as L2 cache to reduce leakage power
• C-states: Various low power states during CPU idle
• P-states: Multiple performance states for run-time energy efficiency
• Thermal Monitor: Self-managed thermal control based on multiple thermal sensors
• Digital Thermal Sensor: For software access to CPU temperature
• Intelligent Voltage Regulator control: To optimize VR efficiency during low loads
• PECI interface: For side-band access to temperature for fan control
Key New Mobile Power/Thermal Features in Penryn
• Deep Power Down Technology (DPD): A radically new and innovative
idle power management state
• Enhanced Dynamic Acceleration Technology (EDAT): A deterministic
and (TDP) power-neutral performance boost for Single Threaded apps
10
Deep Power Down Technology Overview
Concept: When idle, required CPU state is saved in a power-preserved domain on
chip and then power is shut off
Exit: Upon a break event, Power/Clocks are turned on, State is restored and
execution resumes seamlessly from where it left off transparent to OS / applications
• Can occur few 100s of times per second (between keystrokes)
Flexible C
-States to Select Idle Power Level vs. Responsiveness
Active state
C0
C1
C3
C4
Deep Power Down
technology state
State in DPD
Vcc: << Retention
Core clk: OFF
Core voltage*
off
off
off
off
off
flushed
flushed
off
PLL
L1 caches
L2 cache
partial flush
Wakeup time*
off
PLL: OFF
New for Penryn
off
off
Core clock
active
Caches: OFF
Key benefits of DPD
• Lowest Leakage CPU
state you can get to!
• Agnostic to min Vcc
state retention issues
Idle power*
* Rough approximation
DPD enables reaching lower limit of CPU idle power of 0 W
11
Penryn DPD Implementation Overview
VccP
4 Major parts:
8KB SRAMs
VccP
MICROCODE:
• Does state save and restore
• Manages core synchronization
• Supports Auto-demote policy
PM UNIT (“hardware”):
• Coordinates cores at hardware level
STATE STORAGE:
• 8KB (per core)
• Powered from I/O Vcc (VccP)
• ECC protected
STATE DEFINITION:
• What to include?
• Criteria: “Software seamless”
• Inclusions:
• All Architectural state
• Most micro-architectural state
• Exclusions:
• Temp registers used by ucode
• Some others on a case by case basis
• Manages the DPD Power up sequence
• Manages entry/exit protocol with platform
12
Deep Power Down Technology Entry/Exit
VccP
• S/W instruction initiates processor DPD entry
Small amout
of logic is left
powered
up by VccP
during DPD
• CPU does rest of sequencing with chipset/platform
Vcc
• Protocol with chipset to block snoops due to DMA
traffic (no CPU wakeup required) while in DPD state
VRM
VID control
VccP
FSB I/F
I/O
requests
Chipset
OS: Decides
to idle the
processors
• Exit initiated by an break event in platform (interrupt)
through the chipset
• CPU sequences through rest of exit sequence –
including driving VID to VRM, internal hardware
reset, state restore and execution resumption
OS: execute
MWAIT
instruction
CPU: shrink L2;
save uarch
state
CPU: signal
chipset to
enter DPD
C/S: stops
CPU clk, blk I/O
CPU: VRM dn
Deep power
down state
CPU: continue
execution to
next instr.
CPU: restore
Arch/uarch state,
Expand L2
CPU: Internal
RESET,
PLL relock etc
C/S: signals
CPU wakeup,
VRM ramp, bclk
Interrupt
break
event
13
Interrupt Rate Sensitivity of Average Power
Deeper C-states have non-trivial energy cost for transitions
Today, OS makes C-state decisions based on latency, not energy cost
• Latency could be ok, but energy cost could be significantly different
Too frequent transitions into deep C-states could result in net energy loss
• Worse battery life
Penryn implements a “Autodemote” policy in the CPU to
address this issue
Average Power
Example of Average Power MM02
Deep Power Down cross
over point: ~3-4ms
0.5ms 1ms
2ms
3ms
4ms
5ms
10ms 15ms 20ms 50ms Limit
Interrupt Rate
with C4
with Deep Power Down technology
• Heuristics to detects when energy
cost of DPD state entry and exit is
potentially > savings in DPD state
• Demotes such cases to shallower
C-state (C4), although the OS
requested DPD
• Silicon results prove that this was a
good decision!
Source: Intel
14
DPD Results (Average Power)
Average Power (MM05* - Office Productivity)
Penryn with DPD Disabled
Average Power
Merom (Core (tm) 2 Duo processor)
44%
Penryn with DPD Enabled
27%
Actual
Results
measured in
Penryn
Silicon – Up
to 44%
reduction in
Average
Power
Source: Intel Corp, June 2007**
Leakage Current
• 27% to 44% (based on the leakage of the part) reduction on Average Power as
measured by Mobile Mark – Office Productivity benchmark due to DPD feature
• Significant improvement compared to previous generation (Merom)
• Measured Exit latency for DPD state: ~ 150 - 200 us => In expected range
**Config: Data from a sample distribution of parts with various leakage
measured on tester, under respective product POR Vcc, temperature etc
MM05 – BAPCO* Mobile Mark* 2005 benchmark
15
Enhanced Dynamic Acceleration Technology (EDAT)
Concept: In multi-core CPUs, use the power headroom of idle core to boost
performance of the non-idle core
How it works:
When one core enters an idle power C-state (CC3 or deeper) AND
OS requests a higher performance state on the running core,
The non-idle core is boosted up to a higher voltage, higher frequency (EDAT freq)
Overall chip Power still remains within the specified Thermal Design Power (TDP)
2 cores active: Guaranteed
(marked) frequency
Single core active: EDAT freq
P = P0 + P1 <= P (TDP Spec)
P = P0 + P1 = TDP Spec
EDAT Bin
Min Vcc
TDP Top Freq.
Min Vcc
CORE 0 CORE 1
TDP Top Freq.
Frequency
TDP Top Freq.
Frequency
•
•
•
•
Core #2 at
C3-DPD
Min Vcc
Applicability: EDAT is
applicable mostly to
thermally constrained
platforms such as
mobile where the
guaranteed frequency
is not max Vcc limited
Leakage
CORE 0 CORE 1
EDAT provides a significant and predictable
single-threaded performance boost
16
EDAT Implementation Overview
Microarchitecture
OS P-state Request “P[0]”
F/V
Max F/V
Clipping
Logic
EDAT F/V
Max F/V
Hysteresis mechanism
Guar. F/V
•
•
•
•
•
EDAT Logic
EDAT
Disable
Core
CC-state
EDAT
Control
Logic
• Entry based on OS request AND other core idle
• Idle core defined as “CC3” or deeper C-state
• EDAT Freq pre-programmed in chip based on
power, reliability and other constraints
• Exit EDAT mode when Idle core wakes up
Allows short durations where 2 cores active
Reduces perf loss for low activity wakeups
Implemented using a few counters
Voltage Regulator needs to provide for this
Benefits most at high timer tick rates
OS interface
Hysteresis
Mechanism
• EDAT Freq exposed as {Guaranteed + epsilon}
P-state in ACPI table: P[0]
• OS requests P[0] state if perf demand exists
• EDAT logic grants it if power headroom exists
17
EDAT Results (Performance)
EDAT Performance on SPEC CPU2000 (Estimated)
SPECint_base2000 (estimated)
Relative Performance
1.08
SPECfp_base2000 (estimated)
1.06
1.04
1.02
1.00
0.98
Actual
Results
measured in
Penryn
Silicon – Up
to 7% boost
in ST perf
due to EDAT
0.96
Baseline: EDAT OFF
Source: Intel Corp, as of June 2007*
EDAT ON; Low
Interrupt rate
EDAT ON; High
Interrupt rate
EDAT ON with
Hysteresis: High
Interrupt rate
• 5% (on SPECfp_base2000 (est*)) and 7% (on SPECint_base2000 (est*)) performance
gains due to EDAT within the same TDP power envelope!**
• High timer tick rates would have reduced the gains by a few % due to frequent wake up of
idle core and resulting exits from EDAT mode
• Intelligent hysteresis mechanism in Penryn recovers most of that loss
**Configuration used for measurement: Matanzas CRB board, Crestline chipset DDR2-667
memory, Penryn A1 silicon @ 2.4 GHz, FSB 800, OS: Windows XP SP2
*Estimated SPEC metrics based on measurements on preproduction
platforms. SPEC, SPECint and SPECfp are trademarks of SPEC. For
more information on these benchmarks, see www.spec.org
18
New PM feature for Servers: CC3 State
Concept:
• Snoops burn ~30% of active core power
• By avoiding snoops into idle cores, we save power
How it works:
• Today, idle cores are put into Core C1 (CC1), which is a snoopable state
• In Penryn, idle cores can be put into Core C3 (CC3), which is a non-snoopable state.
• First level caches are flushed into the L2 cache before putting cores into CC3
• This prevents cross core snoops and therefore the additional power burnt for snoops
Core0 in CC3
C1E
ACTIVE
Cores
Core 0
Core 0
L2 Cache
(CC1) IDLE Cores (CC1)
IDLE Cores (CC3)
(CC3) IDLE
IDLECores
Cores(CC1)
(CC3)
L2
snoop
Core
snoop
Core 0
L2 Cache
L2 Lookup
L2
Core
snoop
snoop
Core 0
L2 Cache
Core 1
L2 Cache
L2
snoop
Core 1
Core
snoop
cross-core
snoop
Core
snoop
Core 0
L2 Cache
Core
Core
1
Core 1
Core
snoop
snoop
Core 1
DRAM access
Chipset
Memory
CC3 is a simple but effective optimization for power savings
19
CC3 Results (Power savings)
Power savings due to CC3 in Web Server workloads
2.0%
0.0%
-2.0%0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
-4.0%
-6.0%
-8.0%
-10.0%
-12.0%
-14.0%
-16.0%
% Perf loss
% CPU pwr reduction
Up to 16% CPU
power
reduction
measured on
web server
workloads with
CC3 feature
-18.0%
Source: Intel Corp, as of June 2007*
% CPU utilization
• Up to ~16%** CPU power reduction measured on Web Server workload with
negligible performance loss with CC3 feature enabled
• Similar power reductions seen on SPEC CPU2000 (estimated by measurement on
preproduction silicon) (single user mode), other idle power benchmarks for Server
*Configuration used for measurement: Bridgeport2, DP Clovertown 2.33 GHz, 1333 MHz
Bus, 4x2GB FB DIMM 667 MHz, OS: Linux 2.6.18.1; Workload: Specweb99; DBS OFF;
Measured by mapping CC3 as CC1 in BIOS
** Note that the power savings due to this feature will be system
configuration, OS and workload dependent.
20
Summary
New generation Intel® Core™ microarchitecture is leading
the way with 45 nm process technology
New microarchitecture enhancements for higher
performance and improved energy efficiency
New SSE4 instructions for improved video, imaging, and 3D
content performance
New Power management features for improved battery life,
performance and energy efficiency
Moore’s Law is alive and well with Penryn;
Innovative features with tangible benefits being
introduced for Performance and Power
21
Acknowledgements
To the entire Penryn team across Folsom, Santa Clara,
Costa Rica and Israel and the Technology Development
Group for developing and delivering this product to market
22
Legal Notices and Important Information
Regarding the performance and power measurements in this presentation
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different
processor families. See http://www.intel.com/products/processor_number for details.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel
products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers
should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information
on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/index.htm
Intel may make changes to specifications, release dates and product descriptions at any time, without notice. Intel, Pentium and the Intel logo are
trademarks or registered trademarks of the Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others
23