Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Wireless power transfer wikipedia , lookup
Electric power system wikipedia , lookup
History of electric power transmission wikipedia , lookup
Power over Ethernet wikipedia , lookup
Electrification wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Audio power wikipedia , lookup
Alternating current wikipedia , lookup
Immunity-aware programming wikipedia , lookup
Power engineering wikipedia , lookup
Magnetic-core memory wikipedia , lookup
45nm Next Generation Intel® Core™ Microarchitecture (Penryn) HOT CHIPS 2007 Varghese George Principal Engineer, Intel Corp Legal Disclaimer Today’s presentation may contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing available on our website for more information on the risk factors that could cause actual results to differ. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel® Core™ Microarchitecture, Intel® Pentium, Intel® Pentium II, Intel® Pentium III, Intel® Pentium 4, Intel® Pentium Pro, Intel® Pentium D, Intel® Pentium M , Itanium®, Xeon® may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. All dates specified are target dates, are provided for planning purposes only and are subject to change. All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposes only, and are subject to change without notice. Intel and the Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands are the property of their respective owners. Copyright © 2007, Intel Corporation 2 Outline Next Generation 45nm Process Technology benefits Overview of Penryn Architecture and New Features Highlights of New Microarchitecture changes New Power Management Features in Penryn • Deep Power Down Technology • Enhanced Dynamic Acceleration Technology • CC3 in Servers Summary 3 Next generation 45 nm Hi-K metal gate process technology: Continuing Moore’s Law 65 nm Intel® Core™ 2 Duo 143 mm2 4-wide Dynamic Execution Advanced smart cache: 4M L2 Smart Memory access Advanced digital media boost Intelligent power capability 45 nm Penryn family die (dual core) 107 mm2 Same good stuff plus: - 6M L2 cache - 47 new SSE4 instructions - Micro-arch enhancements for further IPC perf - Deep Power Down Technology* - Enhanced Dynamic Acceleration Technology* Penryn advantages with this next generation process: • Lower transistor switching power and leakage current => reduced idle power => longer battery life, quieter systems • Up to 2x improvement in transistor density: Significant area scalability to support additional capabilities within the existing power envelope • Higher core and bus frequency for more performance in the same power envelope * Mobile only features 4 Enhanced Intel® Core™ Microarchitecture Intel Core Microarchitecture New with the Penryn Family Fast Radix-16 Divider Faster OS Primitive Support Enhanced Intel Virtualization Technology Intel® Wide Dynamic Execution Intel® Advanced Smart Cache Larger Caches: up to 12MB 24 Way Set Associativity Intel® Smart Memory Access Improved Store Forwarding Higher bus speeds Intel® Advanced Digital Media Boost Intel® Intelligent Power Capability Intel SSE4 instructions Super Shuffle Engine Deep Power Down Technology Enhanced Intel Dynamic Acceleration Tech Increased Performance and Energy Efficiency Across Applications Covered briefly in this presentation Focus of this presentation 5 New Intel® SSE4 Instructions Intel® SSE 1999 70 instructions •SinglePrecision Vectors, •streaming operations Intel® SSE2 Intel® SSE3 Intel® SSSE3 2000 2004 2006 144 instructions 13 instructions 32 instructions •Doubleprecision Vectors •Complex Arithmetic • Decode •128-bit vector integer Intel® SSE4 2007 47 instructions • Video Accelerators • Graphics building blocks • Coprocessor Accelerators • Penryn supports 47 new instructions • Largest set of new instructions introduced since 2000 • This continues a trend set by SSE, SSE2, SSE3, and SSSE3 • Addresses some popular requests for key functionality • Features to improve the compiler’ compiler’s ability to efficiently vectorize code • Video Encode acceleration functions • Floating-point dot product operation (3D content) • Streaming load for high b/w to WC memory (imaging, GPU-CPU sharing) 6 Super Shuffle Engine Input data (2 x 16 bytes) src1 src2 B0 B0 B1 B1 B2 B2 B3 B3 B4 B4 ... ... B14 B15 • Shuffle operations required for SSE data formatting operations • • • • • • B14 B15 ... ... Super Shuffle engine ... logic op logic op Unpacking Packing Align concatenated sources Wide shifts Insertion and extraction Setup for horizontal arithmetic functions • Penryn super shuffle engine performs 128 bit operation in a single cycle ... dest B0 B1 B2 B3 B4 ... B14 B15 • No software changes required Doubles shuffle throughput Output data (16 bytes) 7 New Radix-16 divider architecture Dividend Divisor • Leading edge divider performance • >50% speed-up over previous generations on average Pre-processing QSL hybrid 68b adder • Innovative radix-16 based architecture • Double pumped: processes 2x2 bits per cycle • Hybrid 68-bit CSA/CPA based for faster pipelined quotient select • Optimized square root <<2 Maximum Processor Instruction Latency (lower is better) hybrid 68b adder <<2 Clocks QSL Post-processing Quotient/remainder 32-bit int divide SP divide DP divide EP divide Previous generation DP SP square square root Penryn root EP square root 8 Streaming Load Instruction • 16 byte Aligned load instruction on WC (write combining) memory • Improves read bandwidth from WC mem by reading cache-line size quantities Temp Buffer inside L1 Data Cache C D GFX But … Low BW Reads 800 MB/s (partial reads) 45nm Next 4th Access Access 2nd Access 1st Access 3rd (Data usage in processor core) Generation Core ™ GFX Fast Streaming Reads! DDR3 B Core ™ DDR3 A High BW Writes With Streaming Loads L2 Cache DDR3 DDR3 Bus Existing Platforms Graphics Mem Up to 6.4 GB/s (cache line reads) Streaming Load is ~8X faster reading from WC Memory 9 Power Management Overview Penryn builds on the capabilities in the Core™ microarchitecture • Fine grained clock gating during runtime: Applied all across the chip • Sleep transistors: In large arrays such as L2 cache to reduce leakage power • C-states: Various low power states during CPU idle • P-states: Multiple performance states for run-time energy efficiency • Thermal Monitor: Self-managed thermal control based on multiple thermal sensors • Digital Thermal Sensor: For software access to CPU temperature • Intelligent Voltage Regulator control: To optimize VR efficiency during low loads • PECI interface: For side-band access to temperature for fan control Key New Mobile Power/Thermal Features in Penryn • Deep Power Down Technology (DPD): A radically new and innovative idle power management state • Enhanced Dynamic Acceleration Technology (EDAT): A deterministic and (TDP) power-neutral performance boost for Single Threaded apps 10 Deep Power Down Technology Overview Concept: When idle, required CPU state is saved in a power-preserved domain on chip and then power is shut off Exit: Upon a break event, Power/Clocks are turned on, State is restored and execution resumes seamlessly from where it left off transparent to OS / applications • Can occur few 100s of times per second (between keystrokes) Flexible C -States to Select Idle Power Level vs. Responsiveness Active state C0 C1 C3 C4 Deep Power Down technology state State in DPD Vcc: << Retention Core clk: OFF Core voltage* off off off off off flushed flushed off PLL L1 caches L2 cache partial flush Wakeup time* off PLL: OFF New for Penryn off off Core clock active Caches: OFF Key benefits of DPD • Lowest Leakage CPU state you can get to! • Agnostic to min Vcc state retention issues Idle power* * Rough approximation DPD enables reaching lower limit of CPU idle power of 0 W 11 Penryn DPD Implementation Overview VccP 4 Major parts: 8KB SRAMs VccP MICROCODE: • Does state save and restore • Manages core synchronization • Supports Auto-demote policy PM UNIT (“hardware”): • Coordinates cores at hardware level STATE STORAGE: • 8KB (per core) • Powered from I/O Vcc (VccP) • ECC protected STATE DEFINITION: • What to include? • Criteria: “Software seamless” • Inclusions: • All Architectural state • Most micro-architectural state • Exclusions: • Temp registers used by ucode • Some others on a case by case basis • Manages the DPD Power up sequence • Manages entry/exit protocol with platform 12 Deep Power Down Technology Entry/Exit VccP • S/W instruction initiates processor DPD entry Small amout of logic is left powered up by VccP during DPD • CPU does rest of sequencing with chipset/platform Vcc • Protocol with chipset to block snoops due to DMA traffic (no CPU wakeup required) while in DPD state VRM VID control VccP FSB I/F I/O requests Chipset OS: Decides to idle the processors • Exit initiated by an break event in platform (interrupt) through the chipset • CPU sequences through rest of exit sequence – including driving VID to VRM, internal hardware reset, state restore and execution resumption OS: execute MWAIT instruction CPU: shrink L2; save uarch state CPU: signal chipset to enter DPD C/S: stops CPU clk, blk I/O CPU: VRM dn Deep power down state CPU: continue execution to next instr. CPU: restore Arch/uarch state, Expand L2 CPU: Internal RESET, PLL relock etc C/S: signals CPU wakeup, VRM ramp, bclk Interrupt break event 13 Interrupt Rate Sensitivity of Average Power Deeper C-states have non-trivial energy cost for transitions Today, OS makes C-state decisions based on latency, not energy cost • Latency could be ok, but energy cost could be significantly different Too frequent transitions into deep C-states could result in net energy loss • Worse battery life Penryn implements a “Autodemote” policy in the CPU to address this issue Average Power Example of Average Power MM02 Deep Power Down cross over point: ~3-4ms 0.5ms 1ms 2ms 3ms 4ms 5ms 10ms 15ms 20ms 50ms Limit Interrupt Rate with C4 with Deep Power Down technology • Heuristics to detects when energy cost of DPD state entry and exit is potentially > savings in DPD state • Demotes such cases to shallower C-state (C4), although the OS requested DPD • Silicon results prove that this was a good decision! Source: Intel 14 DPD Results (Average Power) Average Power (MM05* - Office Productivity) Penryn with DPD Disabled Average Power Merom (Core (tm) 2 Duo processor) 44% Penryn with DPD Enabled 27% Actual Results measured in Penryn Silicon – Up to 44% reduction in Average Power Source: Intel Corp, June 2007** Leakage Current • 27% to 44% (based on the leakage of the part) reduction on Average Power as measured by Mobile Mark – Office Productivity benchmark due to DPD feature • Significant improvement compared to previous generation (Merom) • Measured Exit latency for DPD state: ~ 150 - 200 us => In expected range **Config: Data from a sample distribution of parts with various leakage measured on tester, under respective product POR Vcc, temperature etc MM05 – BAPCO* Mobile Mark* 2005 benchmark 15 Enhanced Dynamic Acceleration Technology (EDAT) Concept: In multi-core CPUs, use the power headroom of idle core to boost performance of the non-idle core How it works: When one core enters an idle power C-state (CC3 or deeper) AND OS requests a higher performance state on the running core, The non-idle core is boosted up to a higher voltage, higher frequency (EDAT freq) Overall chip Power still remains within the specified Thermal Design Power (TDP) 2 cores active: Guaranteed (marked) frequency Single core active: EDAT freq P = P0 + P1 <= P (TDP Spec) P = P0 + P1 = TDP Spec EDAT Bin Min Vcc TDP Top Freq. Min Vcc CORE 0 CORE 1 TDP Top Freq. Frequency TDP Top Freq. Frequency • • • • Core #2 at C3-DPD Min Vcc Applicability: EDAT is applicable mostly to thermally constrained platforms such as mobile where the guaranteed frequency is not max Vcc limited Leakage CORE 0 CORE 1 EDAT provides a significant and predictable single-threaded performance boost 16 EDAT Implementation Overview Microarchitecture OS P-state Request “P[0]” F/V Max F/V Clipping Logic EDAT F/V Max F/V Hysteresis mechanism Guar. F/V • • • • • EDAT Logic EDAT Disable Core CC-state EDAT Control Logic • Entry based on OS request AND other core idle • Idle core defined as “CC3” or deeper C-state • EDAT Freq pre-programmed in chip based on power, reliability and other constraints • Exit EDAT mode when Idle core wakes up Allows short durations where 2 cores active Reduces perf loss for low activity wakeups Implemented using a few counters Voltage Regulator needs to provide for this Benefits most at high timer tick rates OS interface Hysteresis Mechanism • EDAT Freq exposed as {Guaranteed + epsilon} P-state in ACPI table: P[0] • OS requests P[0] state if perf demand exists • EDAT logic grants it if power headroom exists 17 EDAT Results (Performance) EDAT Performance on SPEC CPU2000 (Estimated) SPECint_base2000 (estimated) Relative Performance 1.08 SPECfp_base2000 (estimated) 1.06 1.04 1.02 1.00 0.98 Actual Results measured in Penryn Silicon – Up to 7% boost in ST perf due to EDAT 0.96 Baseline: EDAT OFF Source: Intel Corp, as of June 2007* EDAT ON; Low Interrupt rate EDAT ON; High Interrupt rate EDAT ON with Hysteresis: High Interrupt rate • 5% (on SPECfp_base2000 (est*)) and 7% (on SPECint_base2000 (est*)) performance gains due to EDAT within the same TDP power envelope!** • High timer tick rates would have reduced the gains by a few % due to frequent wake up of idle core and resulting exits from EDAT mode • Intelligent hysteresis mechanism in Penryn recovers most of that loss **Configuration used for measurement: Matanzas CRB board, Crestline chipset DDR2-667 memory, Penryn A1 silicon @ 2.4 GHz, FSB 800, OS: Windows XP SP2 *Estimated SPEC metrics based on measurements on preproduction platforms. SPEC, SPECint and SPECfp are trademarks of SPEC. For more information on these benchmarks, see www.spec.org 18 New PM feature for Servers: CC3 State Concept: • Snoops burn ~30% of active core power • By avoiding snoops into idle cores, we save power How it works: • Today, idle cores are put into Core C1 (CC1), which is a snoopable state • In Penryn, idle cores can be put into Core C3 (CC3), which is a non-snoopable state. • First level caches are flushed into the L2 cache before putting cores into CC3 • This prevents cross core snoops and therefore the additional power burnt for snoops Core0 in CC3 C1E ACTIVE Cores Core 0 Core 0 L2 Cache (CC1) IDLE Cores (CC1) IDLE Cores (CC3) (CC3) IDLE IDLECores Cores(CC1) (CC3) L2 snoop Core snoop Core 0 L2 Cache L2 Lookup L2 Core snoop snoop Core 0 L2 Cache Core 1 L2 Cache L2 snoop Core 1 Core snoop cross-core snoop Core snoop Core 0 L2 Cache Core Core 1 Core 1 Core snoop snoop Core 1 DRAM access Chipset Memory CC3 is a simple but effective optimization for power savings 19 CC3 Results (Power savings) Power savings due to CC3 in Web Server workloads 2.0% 0.0% -2.0%0.0% 20.0% 40.0% 60.0% 80.0% 100.0% -4.0% -6.0% -8.0% -10.0% -12.0% -14.0% -16.0% % Perf loss % CPU pwr reduction Up to 16% CPU power reduction measured on web server workloads with CC3 feature -18.0% Source: Intel Corp, as of June 2007* % CPU utilization • Up to ~16%** CPU power reduction measured on Web Server workload with negligible performance loss with CC3 feature enabled • Similar power reductions seen on SPEC CPU2000 (estimated by measurement on preproduction silicon) (single user mode), other idle power benchmarks for Server *Configuration used for measurement: Bridgeport2, DP Clovertown 2.33 GHz, 1333 MHz Bus, 4x2GB FB DIMM 667 MHz, OS: Linux 2.6.18.1; Workload: Specweb99; DBS OFF; Measured by mapping CC3 as CC1 in BIOS ** Note that the power savings due to this feature will be system configuration, OS and workload dependent. 20 Summary New generation Intel® Core™ microarchitecture is leading the way with 45 nm process technology New microarchitecture enhancements for higher performance and improved energy efficiency New SSE4 instructions for improved video, imaging, and 3D content performance New Power management features for improved battery life, performance and energy efficiency Moore’s Law is alive and well with Penryn; Innovative features with tangible benefits being introduced for Performance and Power 21 Acknowledgements To the entire Penryn team across Folsom, Santa Clara, Costa Rica and Israel and the Technology Development Group for developing and delivering this product to market 22 Legal Notices and Important Information Regarding the performance and power measurements in this presentation Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/index.htm Intel may make changes to specifications, release dates and product descriptions at any time, without notice. Intel, Pentium and the Intel logo are trademarks or registered trademarks of the Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others 23