Download ca-2011-lec14-by-Ohad

Computer Architecture Advanced Topics 1 Computer Architecture 2010 – Advanced Topics Pentium® M Processor 2 Computer Architecture 2010 – Advanced Topics From Pentium® M Processor  Intel’s 1st processor designed for mobility – Achieve best performance at given power and thermal constraints – Achieve longest battery life Banias Dothan Sandy Bridge transistors 77M 140M 995M / 506M (55M / Core) process 130nm 90nm 32nm Die size 84 mm2 85mm2 216mm2 – 4C+GT2 131mm2 - 2C +GT1 Peak power 24.5 watts 21 watts 17 – 90 W Freq 1.7 GHz 2.1GHz 2.8 – 3.8 – 4.4GHz L1 cache 32KB I$ + 32KB D$ 32KB I$ + 32KB D$ 32KB I$ + 32KB D$ L2 cache 1MB 2MB 256K (per core) + 3-8MB L3 src: http://www.anandtech.com 3 Computer Architecture 2010 – Advanced Topics Example Standby Bridge  Use Moor’s Law and process improvements to:  Power/Performance  Integration  Reduce communication  Reduce Latencies  (cost in complexity)  More Performance and Efficiency via :  Speed Step  Memory Hierarchy  Multi-Core  Multi-Thread  Out-of-Order Execution  Predictors  Multi-Operand (vector) Instructions  Custom Processing src: http://www.anandtech.com 4 Computer Architecture 2010 – Advanced Topics Performance per Watt  Mobile’s smaller form-factor decreases power budget – Power generates heat, which must be dissipated to keep transistors within allowed temperature – Limits the processor’s peak power consumption  Change the target – Old target: get max performance – New target: get max performance at a given power envelope  Performance per Watt  Performance via frequency increase – Power = CV2f, but increasing f also requires increasing V – X% performance costs 3X% power  Assume performance linear with frequency  A power efficient feature – better than 1:3 performance : power – Otherwise it is better to just increase frequency – All Banias u-arch features (aimed at performance) are power efficient 5 Computer Architecture 2010 – Advanced Topics Higher Performance vs. Longer Battery Life  Processor average power is <10% of the platform – The processor reduces power in periods of low processor activity – The processor enters lower power states in idle periods  Average power includes low-activity periods and idle-time – Typical: 1W – 3W  Max power limited by heat dissipation – Typical: 20W – 100W Intel® LAN Fan DVD ICH 2% 2% 2% 3% CLK 5% Display (panel + inverter) 33% HDD 8% GFX 8% Misc. 8% CPU 10% Intel® MCH Power Supply 10% 9%  Decision – Optimize for performance when Active – Optimize for battery life when idle src: http://www.anandtech.com 6 Computer Architecture 2010 – Advanced Topics Higher Performance vs. Longer Battery Life  High Dynamic Range – – – – Long periods of Idle w/ picks of activity Minimize power when Idle Adequate performance when active Quick transitions  Max power limited by heat dissipation – Typical: 3W (cell) – 6W (tablet) 15W (small PC) 60W (main stream PC) 150W+ (desktop) – How can the design fit all ?  Decision – Optimize for user experience when Active (adequate performance) – Optimize for battery life when idle src: http://www.anandtech.com 7 Computer Architecture 2010 – Advanced Topics Static Power  The power consumed by a processor consists of – Active power: used to switch transistors – Static power: leakage of transistors under voltage  Static power is a function of – Number of transistors and their type – Operating voltage – Die temperature  Leakage is growing dramatically in new process technologies  Pentium® M reduces static power consumption – The L2 cache is built with low-leakage transistors (2/3 of the die transistors)  Low-leakage transistors are slower, increasing cache access latency  The significant power saved justifies the small performance loss – Enhanced SpeedStep® technology  8 Reduces voltage and temperature on low processor activity Computer Architecture 2010 – Advanced Topics Less is More  Less instructions per task – Advanced branch prediction reduces #wrong instructions executed – SSE instructions reduce the number of instructions architecturally  Less uops per instruction – Uops fusion – Dedicated stack engine  Less transistor switches per micro-op – efficient bus – various lower-level optimizations  Less energy per transistor switch – Enhanced SpeedStep® technology Power-awareness top to bottom 9 Computer Architecture 2010 – Advanced Topics Improved Branch Predictor  Pentium® M employs best-in-class branch prediction – Bimodal predictor, Global predictor, Loop detector – Indirect branch predictor  Reduces number of wrong instructions executed – Saves energy spent executing wrong instructions  Loop predictor Count Limit Prediction – Analyzes branches for loop behavior   Moving in one direction (taken or NT) a fixed number of times Ended with a single movement in the opposite direction +1 = 0 – Detect exact loop count – Loop predicted accurately 10 Computer Architecture 2010 – Advanced Topics Indirect Branch Predictor  Indirect jumps are widely used in object-oriented code (C++, Java)  Targets are data dependent – Resolved at execution  high misprediction penalty  Initially, allocate indirect branch only in target array (TA) – If TA mispredicts  allocate in iTA according to global history  Multiple targets allocated for a given branch – Indirects with a single target predicted by TA, saving iTA space  Use iTA if TA indicates indirect branch + iTA hits Branch IP Target Array hit indirect branch target global history 11 iTA target hit HIT Predicted Target Computer Architecture 2010 – Advanced Topics Dedicated Stack Engine  PUSH, POP, CALL, RET update ESP (add or sub an offset) – Use a dedicated add uop  Track the ESP offset at the front-end – ID maintains offset in ESP_delta (+/- Osize) – Eliminates need for uops updating ESP – Patch displacements of stack operations  In some cases, ESP actual value is needed – For example: add eax, esp, 3 – A sync uop is inserted before the instruction   if ESP_delta != 0 ESP = ESP + ESP_delta – Reset ESP_delta  ESP_delta recovered on jump misprediction 12 Computer Architecture 2010 – Advanced Topics ESP Tracking Example Δ=0 PUSH eax PUSH ebx INC eax ESP = ESP - 4 Δ=Δ-4 Δ=-4 STORE [ESP], EAX STORE [ESP-4], EAX ESP = ESP - 4 Δ=Δ-4 STORE [ESP], EBX STORE [ESP-8], EBX EAX = ADD EAX, 1 EAX = ADD EAX, 1 Δ=-8 ESP = ADD ESP, 1 ESP =Sync SUBESP ESP,! 8 ΔΔ==-08 Δ=-8 INC esp ESP = ADD ESP, 1 Δ=0 13 Computer Architecture 2010 – Advanced Topics Uop Fusion  The Instruction Decoder breaks an instruction into uops – A conventional uop consists of a single operation operating on two sources  An instruction requires multiple uops when – the instruction operates on more than two sources, or – the nature of the operation requires a sequence of operations  Uop fusion: in some cases the decoder fuses 2 uops into one uop – A short field added to the uop to support fusing of specific uop pairs  Uop fusion reduces the number of uops by 10% – Increases performance by effectively widening rename, and retire bandwidth – More instructions can be decode by all decoders  The same task is accomplished by processing fewer uops – Decreases the energy required to complete a given task 14 Computer Architecture 2010 – Advanced Topics A 2-uop Load-Op add eax,[ebp+4*esi+8] Load-op with 3 reg. operands Decoder Decoded into 2 uops LD: read data from mem OP: reg ← reg op data LD tmp=load[ebp+4*esi+8] OP eax = eax + tmp Scheduler LD OP The LD and OP are inherently serial OP dispatched only when LD completes 15 MEU ALU LD OP Computer Architecture 2010 – Advanced Topics A 1-uop Load-Op add eax,[ebp+4*esi+8] Decoded into 1 uop Fused uops has a 3rd source – new field in uop holds index register Increase decode BW Decoder LD + OP Scheduler LD + OP Increase alloc BW and ROB/RS effective size Dispatched twice OP dispatched after LD fused uop retires after both LD&OP complete Increase retire BW 16 eax = eax + load[ebp+4*esi+8] Cache ALU LD OP Computer Architecture 2010 – Advanced Topics Enhanced SpeedStep™ Technology  The “Basic” SpeedStep™ Technology had – 2 operating points – Non-transparent switch  The “Enhanced” version provides – Multi voltage/frequency operating points. The Pentium M processor 1.6GHz operation ranges: From 600MHz @ 0.956V  To 1.6GHz @ 1.484V  Voltage, Frequency, Power 4.0 Freq (GHz) Power (Watts) 3.6 6.1X 3.2 16 2.8  Benefits 14 Efficiency ratio = 2.3 2.4 10 1.2 2.7X ) 1.6 8 6 0.8 4 0.4 2 0.0 0 0.8 17 12 2.0 (GHz Frequency – Higher power efficiency 2.7X lower frequency  2X performance loss  >2X energy gain – Outstanding battery life – Excellent thermal mgmt. 18 1.0 1.2 Voltage (Volt) 1.4 Typical Power – Transparent switch – Frequent switches 20 1.6 Computer Architecture 2010 – Advanced Topics Trace Cache (Pentium® 4 Processor) 18 Computer Architecture 2010 – Advanced Topics Trace Cache  Decoding several IA-32 inst/clock at high frequency is difficult – Instructions have a variable length and have many different options – Takes several pipe-stages  Adds to the branch mis-prediction penalty  Trace-cache: cache uops of previously decoded instructions – Decoding is only needed for instructions that miss the TC  The TC is the primary (L1) instruction cache – Holds 12K uops – 8-way set associative with LRU replacement  The TC has its own branch predictor (Trace BTB) – Predicts branches that hit in the TC – Directs where instruction fetching needs to go next in the TC 19 Computer Architecture 2010 – Advanced Topics Traces  Instruction caches fetch bandwidth is limited to a basic blocks – Cannot provide instructions across a taken branch in the same cycle Jump into the line Jump out of the line jmp  The TC builds traces: program-ordered sequences of uops – Allows the target of a branch to be included in the same TC line as the branch itself jmp jmp jmp jmp  Traces have variable length – Broken into trace lines, six uops per trace line – There can be many trace lines in a single trace 20 Computer Architecture 2010 – Advanced Topics Hyper Threading Technology (Pentium® 4 Processor ) Based on Hyper-Threading Technology Architecture and Micro-architecture Intel Technology Journal 21 Computer Architecture 2010 – Advanced Topics Thread-Level Parallelism  Multiprocessor systems have been used for many years – There are known techniques to exploit multiprocessors  Software trends – Applications consist of multiple threads or processes that can be executed in parallel on multiple processors  Thread-level parallelism (TLP) – threads can be from – the same application – different applications running simultaneously – operating system services  Increasing single thread performance becomes harder – and is less and less power efficient  Chip Multi-Processing (CMP) – Two (or more) processors are put on a single die 22 Computer Architecture 2010 – Advanced Topics Multi-Threading  Multi-threading: a single processor executes multiple threads  Time-slice multithreading – The processor switches between software threads after a fixed period – Can effectively minimize the effects of long latencies to memory  Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses  A deficiency of both time-slice MT and switch-on-event MT – They do not cover for branch mis-predictions and long dependencies  Simultaneous multi-threading (SMT) – Multiple threads execute on a single processor simultaneously w/o switching – Makes the most effective use of processor resources  23 Maximizes performance vs. transistor count and power Computer Architecture 2010 – Advanced Topics Hyper-threading (HT) Technology  HT is SMT – Makes a single processor appear as 2 logical processors = threads  Each thread keeps a its own architectural state – General-purpose registers – Control and machine state registers  Each thread has its own interrupt controller – Interrupts sent to a specific logical processor are handled only by it  OS views logical processors (threads) as physical processors – Schedule threads to logical processors as in a multiprocessor system  From a micro-architecture perspective – Thread share a single set of physical resources  24 caches, execution units, branch predictors, control logic, and buses Computer Architecture 2010 – Advanced Topics Two Important Goals  When one thread is stalled the other thread can continue to make progress – Independent progress ensured by either   Partitioning buffering queues and limiting the number of entries each thread can use Duplicating buffering queues  A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active 25 Computer Architecture 2010 – Advanced Topics Front End  Each thread manages its own next-instruction-pointer  Threads arbitrate TC access every cycle (Ping-Pong) – If both want to access the TC – access granted in alternating cycles – If one thread is stalled, the other thread gets the full TC bandwidth  TC entries are tagged with thread-ID – Dynamically allocated as needed – Allows one logical processor to have more entries than the other TC Hit 26 TC Miss Computer Architecture 2010 – Advanced Topics Front End (cont.)  Branch prediction structures are either duplicated or shared – The return stack buffer is duplicated – Global history is tracked for each thread – The large global history array is a shared  Entries are tagged with a logical processor ID  Each thread has its own ITLB  Both threads share the same decoder logic – if only one needs the decode logic, it gets the full decode bandwidth – The state needed by the decodes is duplicated  Uop queue is hard partitioned – Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls 27 Computer Architecture 2010 – Advanced Topics Out-of-order Execution  ROB and MOB are hard partitioned – Enforce fairness and prevent deadlocks  Allocator ping-pongs between the thread – A thread is selected for allocation if    28 Its uop-queue is not empty its buffers (ROB, RS) are not full It is the thread’s turn, or the other thread cannot be selected Computer Architecture 2010 – Advanced Topics Out-of-order Execution (cont)  Registers renamed to a shared physical register pool – Store results until retirement  After allocation and renaming uops are placed in one of 2 Qs – Memory instruction queue and general instruction queue  The two queues are hard partitioned – Uops are read from the Q’s and sent to the scheduler using ping-pong  The schedulers are oblivious to threads – Schedule uops based on dependencies and exe. resources availability  Regardless of their thread – Uops from the two threads can be dispatched in the same cycle – To avoid deadlock and ensure fairness  Limit the number of active entries a thread can have in each scheduler’s queue  Forwarding logic compares physical register numbers – Forward results to other uops without thread knowledge 29 Computer Architecture 2010 – Advanced Topics Out-of-order Execution (cont)  Memory is largely oblivious – L1 Data Cache, L2 Cache, L3 Cache are thread oblivious  All use physical addresses – DTLB is shared  Each DTLB entry includes a thread ID as part of the tag  Retirement ping-pongs between threads – If one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread 30 Computer Architecture 2010 – Advanced Topics Single-task And Multi-task Modes  MT-mode (Multi-task mode) – Two active threads, with some resources partitioned as described earlier  ST-mode (Single-task mode) – There are two flavors of ST-mode  single-task thread 0 (ST0) – only thread 0 is active  single-task thread 1 (ST1) – only thread 1 is active – Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources  Moving the processor from between modes Thread 0 executes HALT Interrupt ST0 Thread 1 executes HALT 31 Low Power Thread 1 executes HALT ST1 MT Thread 0 executes HALT Computer Architecture 2010 – Advanced Topics Operating System And Applications  An HT processor appears to the OS and application SW as 2 processors – The OS manages logical processors as it does physical processors The OS should implement two optimizations:  Use HALT if only one logical processor is active – Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor  On a multi-processor system, – Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor – Allows SW threads to use different physical resources when possible 32 Computer Architecture 2010 – Advanced Topics

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ca-2011-lec14-by-Ohad