Download Intel_P6_Arch

ECE4100/6100 P6 & NetBurst Microarchitecture 1 ECE4100/6100 Why study P6 from last millennium? A paradigm shift from Pentium A RISC core disguised as a CISC Huge market success: Microarchitecture And stock price Architected by former VLIW and RISC folks Multiflow (pioneer in VLIW architecture for superminicomputer) Intel i960 (Intel’s RISC for graphics and embedded controller) Netburst (P4’s microarchitecture) is based on P6 2 ECE4100/6100 P6 Basics  One implementation of IA32 architecture  Super-pipelined processor  3-way superscalar  In-order front-end and back-end  Dynamic execution engine (restricted dataflow)  Speculative execution  P6 microarchitecture family processors include Pentium Pro Pentium II (PPro + MMX + 2x caches—16KB I/16KB D) Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) Celeron (without MP support) Later P-II/P-III/Celeron all have on-die L2 cache 3 ECE4100/6100 x86 Platform Architecture Host Processor P6 Core L1 Cache (SRAM) Back-Side L2 Cache (SRAM) On-die or on-package Bus GPU Graphics Processor Front-Side Bus AGP System Memory (DRAM) MCH ICH Local Frame Buffer chipset PCI USB 4 I/O ECE4100/6100 Pentium III Die Map 5                    EBL/BBL – External/Backside Bus logic MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer ECE4100/6100 ISA Enahncement (on top of Pentium) CMOVcc / FCMOVcc r, r/m Conditional moves (predicated move) instructions Based on conditional code (cc) FCOMI/P : compare FP stack and set integer flags RDPMC/RDTSC instructions Uncacheable Speculative Write-Combining (USWC) — weakly ordered memory type for graphics memory MMX in Pentium II SIMD integer operations SSE in Pentium III Prefetches (non-temporal nta + temporal t0, t1, t2), sfence SIMD single-precision FP operations 6 ECE4100/6100 RS Disp Exec / WB ROB DIS RET1 RET2 EX 31 32 33 42 43 Dcache2 DCache1 MOB disp 31 32 33 42 43 …….. 40 41 42 43 81 82 83 7 MOB Scheduling Delay … Ret ROB rd RRF wr Retirement in-order boundary DCache1 DCache2 81 82 83 Ret ptr wr .. 81 82 83 … RAT Exec n Exec2 31 32 33 .. AGU … DEC2 Mob wakeup Blocking memory pipeline 81: Mem/FP WB 82: Int WB 83: Data WB DEC1 ROB Scheduling Delay MOB blk MOB wr Non-blocking memory pipeline RS Scheduling Delay IFU3 31 32 33 AGU Multi-cycle inst pipeline … IFU2 82 83 FE in-order boundary Single-cycle inst pipeline IDQ RAT 20 21 22 RS schd Dec2 Br Dec 11 12 13 14 15 16 17 RS Write In-order FE I-Cache ILD Rotate Dec1 Next IP P6 Pipelining IFU1 91 92 93 ECE4100/6100 P6 Microarchitecture External bus Data Cache Unit (L1) Chip boundary Bus Cluster Memory Order Buffer Bus interface unit Memory Cluster AGU Instruction Instruction Fetch Fetch Unit Unit MMX IEU/JEU IEU/JEU Control Flow BTB/BAC FEU Instruction Fetch Cluster (Restricted) Data Flow MIU Instruction Instruction Decoder Decoder Microcode Sequencer Register Alias Table Reservation Station Allocator Issue Cluster 8 ROB & Retire RF Out-of-order Cluster ECE4100/6100 Instruction Fetching Unit data Other fetch requests addr Streaming Buffer Select mux Instruction buffer Next PC Mux Linear Address Instruction Cache ILD Length marks Instruction rotator Victim Cache P.Addr Instruction TLB Prediction marks #bytes consumed by ID Branch Target Buffer  IFU1: Initiate fetch, requesting 16 bytes at a time  IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction  IFU3: Align instructions to 3 decoders in 4-1-1 format 9 ECE4100/6100 Dynamic Branch Prediction W0 W1 W2 Pattern History Tables (PHT) W3 New (spec) history 512-entry BTB 1 1 1 0 1 Branch History Register 0 (BHR) 0000 0001 0010 Spec. update 1101 1110 1 1111 Prediction 0 2-bit sat. counter Rc: Branch Result     Similar to a 2-level PAs design Associated with each BTB entry W/ 16-entry Return Stack Buffer 4 branch predictions per cycle (due to 16-byte fetch per cycle)  Static prediction provided by Branch Address Calculator when BTB misses (see prior slide) 10 ECE4100/6100 Static Branch Prediction No No Unconditional PC-relative? BTB miss? Yes Yes PC-relative? No Return? Yes No BTB’s decision Yes No Indirect jump Conditional? Yes Taken Backwards? Taken Yes Taken 11 No Taken Not Taken Taken ECE4100/6100 X86 Instruction Decode IFU3 complex (1-4) Microinstruction sequencer (MS)      simple (1) simple (1) Instruction decoder queue (6 ops) 4-1-1 decoder Decode rate depends on instruction alignment DEC1: translate x86 into micro-operation’s (ops) DEC2: move decoded ops to ID queue MS performs translations either Next 3 inst #Inst to dec S,S,S 3 S,S,C First 2 S,C,S First 1 S,C,C First 1 C,S,S 3 C,S,C First 2 C,C,S First 1 C,C,C First 1 S: Simple C: Complex  Generate entire op sequence from microcode ROM  Receive 4 ops from complex decoder, and the rest from microcode ROM 12 ECE4100/6100 Allocator The interface between in-order and out-of-order pipelines Allocates “3-or-none” ops per cycle into RS, ROB “all-or-none” in MOB (LB and SB) Generate physical destination Pdst from the ROB and pass it to the Register Alias Table (RAT) Stalls upon shortage of resources 13 ECE4100/6100 Register Alias Table (RAT) FP TOS Adjust Integer RAT Array Array Physical Src (Psrc) FP RAT Array Int and FP Overrides In-order queue Logical Src Renaming Example RRF PSrc EAX 0 25 RAT PSrc’s EBX ECX 0 EDX 0 1 2 ECX 15 Allocator Physical ROB Pointers RRF ROB  Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle  40 80-bit physical registers embedded in the ROB (thereby, 6 bit to specify PSrc)  RAT looks up physical ROB locations for renamed sources based on RRF bit 14 ECE4100/6100 Partial Register Width Renaming FP TOS Adjust Size(2) RRF(1) Array Physical Src FP RAT Array Int and FP Overries In-order queue Logical Src Integer RAT Array INT Low Bank (32b/16b/L): 8 entries INT High Bank (H): 4 entries RAT Physical Src Allocator op0: op1: op2: op3: MOV MOV ADD ADD Physical ROB Pointers from Allocator  32/16-bit accesses:  Read from low bank  Write to both banks  8-bit RAT accesses: depending on which Bank is being written 15 PSrc(6) AL AH AL AH = = = = (a) (b) (c) (d) ECE4100/6100 Partial Stalls due to RAT read AX EAX write CMP INC JBE EAX, EBX ECX XX ; stall Partial flag stalls (1) MOVB AL, m8 ; ADD EAX, m32 ; stall Partial register stalls TEST EBX, EBX LAHF ; stall XOR EAX, EAX MOVB AL, m8 ; ADD EAX, m32 ; no stall Partial flag stalls (2) Idiom Fix (1)  JBE reads both ZF and CF while INC affects (ZF,OF,SF,AF,PF)  LAHF loads low byte of EFLAGS SUB EAX, EAX MOVB AL, m8 ; ADD EAX, m32 ; no stall Idiom Fix (2)  Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read  Partial flags stalls: Occurs when a subsequent instruction read more flags than a prior unretired instruction touches 16 ECE4100/6100 Reservation Stations WB bus 0 Port 0 IEU0 Fadd Fmul Imul Div WB bus 1 Port 1 IEU1 JEU Pfadd Pfshuf Loaded data RS Port 2 AGU0 Ld addr LDA MOB Port 3 AGU1 St addr STA DCU STD St data Port 4 ROB     Retired RRF data Gateway to execution: binding max 5 op to each port per cycle 20 op entry buffer bridging the In-order and Out-of-order engine RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. Oldest first FIFO scheduling when multiple ops are ready at the same cycle 17 Pfmul ECE4100/6100 ReOrder Buffer  A 40-entry circular buffer Similar to that described in [SmithPleszkun85] 157-bit wide Provide 40 alias physical registers  Out-of-order completion  Deposit exception in each entry  Retirement (or de-allocation) RS ALLOC ROB RAT After resolving prior speculation Handle exceptions thru MS Clear OOO state when a mis-predicted branch or exception is detected 3 op’s per cycle in program order For multi-op x86 instructions: none or all (atomic) 18 RRF .. . (exp) code assist MS ECE4100/6100 Memory Execution Cluster RS / ROB LD STA STD Load Buffer DTLB FB DCU LD STA Store Buffer EBL Memory Cluster Blocks  Manage data memory accesses  Address Translation  Detect violation of access ordering  Fill buffers in DCU (similar to MSHR [Kroft’81]) for handling cache misses (nonblocking) 19 ECE4100/6100 Memory Order Buffer (MOB)     Allocated by ALLOC A second order RS for memory operations 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD) MOB  16-entry load buffer (LB)  12-entry store address buffer (SAB)  SAB works in unison with  Store data buffer (SDB) in MIU  Physical Address Buffer (PAB) in DCU  Store Buffer (SB): SAB + SDB + PAB  Senior Stores  Upon STD/STA retired from ROB  SB marks the store “senior”  Senior stores are committed back in program order to memory when bus idle or SB full  Prefetch instructions in P-III  Senior load behavior  Due to no explicit architectural destination 20 ECE4100/6100 Store Coloring x86 Instructions op’s mov (0x1220), ebx std sta std sta ld ld std sta ld mov (0x1110), eax mov ecx, (0x1220) mov edx, (0x1280) mov (0x1400), edx mov edx, (0x1380) (ebx) 0x1220 (eax) 0x1100 (edx) 0x1400 store color 2 2 3 3 3 3 4 4 4  ALLOC assigns Store Buffer ID (SBID) in program order  ALLOC tags loads with the most recent SBID  Check loads against stores with equal or younger SBIDs for potential address conflicts  SDB forwards data if conflict detected 21 ECE4100/6100 Memory Type Range Registers (MTRR) Control registers written by the system (OS) Supporting Memory Types UnCacheable (UC) Uncacheable Speculative Write-combining (USWC or WC) Use a fill buffer entry as WC buffer WriteBack (WB) Write-Through (WT) Write-Protected (WP) E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write. Page Miss Handler (PMH) Look up MTRR while supplying physical addresses Return memory types and physical address to DTLB 22 ECE4100/6100 Intel NetBurst Microarchitecture Pentium 4’s microarchitecture, a post-P6 new generation Original target market: Graphics workstations, but … the major competitor screwed up themselves… Design Goals: Performance, performance, performance, … Unprecedented multimedia/floating-point performance Streaming SIMD Extensions 2 (SSE2) Reduced CPI Low latency instructions High bandwidth instruction fetching Rapid Execution of Arithmetic & Logic operations Reduced clock period New pipeline designed for scalability 23 ECE4100/6100 Innovations Beyond P6 Hyperpipelined technology Streaming SIMD Extension 2 Enhanced branch predictor Execution trace cache Rapid execution engine Advanced Transfer Cache Hyper-threading Technology (in Xeon and Xeon MP) 24 ECE4100/6100 Pentium 4 Fact Sheet  IA-32 fully backward compatible  Available at speeds ranging from 1.3 to ~3 GHz  Hyperpipelined (20+ stages)  42+ million transistors  0.18 μ for 1.7 to 1.9GHz; 0.13μ for 1.8 to 2.8GHz;  Die Size of 217mm2  Consumes 55 watts of power at 1.5Ghz  400MHz (850) and 533MHz (850E) system bus  512KB or 256KB 8-way full-speed on-die L2 Advanced Transfer Cache (up to 89.6 GB/s @2.8GHz to L1)  1MB or 512KB L3 cache (in Xeon MP)  144 new 128 bit SIMD instructions (SSE2)  HyperThreading Technology (only enabled in Xeon and Xeon MP) 25 ECE4100/6100 Recent Intel IA-32 Processors 26 ECE4100/6100 Building Blocks of Netburst System bus Bus Unit L1 Data Cache Level 2 Cache Execution Units Memory subsystem INT and FP Exec. Unit Fetch/ Dec ETC μROM OOO logic Branch history update BTB / Br Pred. Front-end Retire 27 Out-of-Order Engine ECE4100/6100 Pentium 4 Microarchitecture BTB (4k entries) I-TLB/Prefetcher Trace Cache BTB (512 entries) 64 bits IA32 Decoder Code ROM Execution Trace Cache op Queue Allocator / Register Renamer Memory op Queue Memory scheduler INT / FP op Queue 64-bit System Bus Quad Pumped 400M/533MHz 3.2/4.3 GB/sec BIU Fast Slow/General FP scheduler Simple FP INT Register File / Bypass Network FP RF / Bypass Ntwk U-L2 Cache FP FP AGU AGU 2x ALU 2x ALU Slow ALU 256KB 8-way Move Simple Simple Complex MMX Ld addr St addr 128B line, WB Inst. Inst. Inst. SSE/2 48 GB/s 256 bits L1 Data Cache (8KB 4-way, 64-byte line,28WT, 1 rd + 1 wr port) @1.5Gz ECE4100/6100 Pipeline Depth Evolution PREF DEC DEC EXEC WB P5 Microarchitecture IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX RET1 RET2 P6 Microarchitecture TC NextIP TC Fetch Drive Alloc Rename Queue Schedule Dispatch NetBurst Microarchitecture 29 Reg File Exec Flags Br Ck Drive ECE4100/6100 Execution Trace Cache Primary first level I-cache to replace conventional L1 Decoding several x86 instructions at high frequency is difficult, take several pipeline stages Branch misprediction penalty is horrible lost 20 pipeline stages vs. 10 stages in P6 Advantages Cache post-decode ops High bandwidth instruction fetching Eliminate x86 decoding overheads Reduce branch recovery time if TC hits Hold up to 12,000 ops 6 ops per trace line Many (?) trace lines in a single trace 30 ECE4100/6100 Execution Trace Cache Deliver 3 op’s per cycle to OOO engine X86 instructions read from L2 when TC misses (7+ cycle latency) TC Hit rate ~ 8K to 16KB conventional I-cache Simplified x86 decoder Only one complex instruction per cycle Instruction > 4 op will be executed by micro-code ROM (P6’s MS) Perform branch prediction in TC 512-entry BTB + 16-entry RAS With BP in x86 IFU, reduce 1/3 misprediction compared to P6 Intel did not disclose the details of BP algorithms used in TC and x86 IFU (Dynamic + Static) 31 ECE4100/6100 Out-Of-Order Engine Similar design philosophy with P6 uses Allocator Register Alias Table 128 physical registers 126-entry ReOrder Buffer 48-entry load buffer 24-entry store buffer 32 ECE4100/6100 Register Renaming Schemes Data Status RRF Front-end RAT EAX EBX ECX EDX ESI EDI ESP EBP Retirement RAT EAX EBX ECX EDX ESI EDI ESP EBP P6 Register Renaming RF (128-entry) ROB (126) Allocated sequentially ROB (40-entry) Allocated sequentially RAT EAX EBX ECX EDX ESI EDI ESP EBP .. . .. . Data .. . .. . Status NetBurst Register Renaming 33 ECE4100/6100 Micro-op Scheduling  op FIFO queues Memory queue for loads and stores Non-memory queue  op schedulers Several schedulers fire instructions to execution (P6’s RS) 4 distinct dispatch ports Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from ld/st ports) Exec Port 0 Fast ALU (2x pumped) FP Move Exec Port 1 Fast ALU (2x pumped) •Add/sub •FP/SSE Move •Add/sub •Logic •FP/SSE Store •Store Data •FXCH •Branches INT Exec •Shift •Rotate FP Exec •FP/SSE Add •FP/SSE Mul •FP/SSE Div 34 •MMX Load Port Store Port Memory Load Memory Store •Loads •LEA •Prefetch •Stores ECE4100/6100 Data Memory Accesses 8KB 4-way L1 + 256KB 8-way L2 (with a HW prefetcher) Load-to-use speculation Dependent instruction dispatched before load finishes Due to the high frequency and deep pipeline depth Scheduler assumes loads always hit L1 If L1 miss, dependent instructions left the scheduler receive incorrect data temporarily – mis-speculation Replay logic – Re-execute the load when mis-speculated Independent instructions are allowed to proceed Up to 4 outstanding load misses (= 4 fill buffers in original P6) Store-to-load forwarding buffer  24 entries  Have the same starting physical address  Load data size <= store data size 35 ECE4100/6100 Streaming SIMD Extension 2 P-III SSE (Katmai New Instructions: KNI) Eight 128-bit wide xmm registers (new architecture state) Single-precision 128-bit SIMD FP Four 32-bit FP operations in one instruction Broken down into 2 ops for execution (only 80-bit data in ROB) 64-bit SIMD MMX (use 8 mm registers — map to FP stack) Prefetch (nta, t0, t1, t2) and sfence P4 SSE2 (Willamette New Instructions: WNI) Support Double-precision 128-bit SIMD FP Two 64-bit FP operations in one instruction Throughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.) Enhanced 128-bit SIMD MMX using xmm registers 36 ECE4100/6100 Examples of Using SSE X3 X2 X1 X0 xmm1 X3 X2 X1 X0 xmm1 X3 X2 X1 X0 xmm1 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 xmm2 xmm1 Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0 op op op op X3 op Y3 X2 op Y2 X1 op Y1X0 op Y0 op xmm1 Packed SP FP operation (e.g. ADDPS xmm1, xmm2) X3 X2 X1 X0 op Y0 Scalar SP FP operation (e.g. ADDSS xmm1, xmm2) 37 Y3 Y3 X0 X1 xmm1 Shuffle FP operation (8-bit imm) imm8) (e.g. SHUFPS xmm1, xmm2, 0xf1) ECE4100/6100 Examples of Using SSE and SSE2 SSE X3 X2 X1 X0 xmm1 X3 X2 X1 X0 xmm1 X3 X2 X1 X0 xmm1 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 xmm2 xmm1 Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0 op op op op X3 op Y3 X2 op Y2 X1 op Y1X0 op Y0 op xmm1 Packed SP FP operation (e.g. ADDPS xmm1, xmm2) X3 X2 X1 X0 op Y0 Scalar SP FP operation (e.g. ADDSS xmm1, xmm2) Y3 Y3 X0 X1 xmm1 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, imm8) 0xf1) SSE2 X1 X0 xmm1 X1 X0 xmm1 X1 X0 Y1 Y0 xmm2 Y1 Y0 xmm2 Y1 Y0 op op X1 op Y1 X0 op Y0 xmm1 Packed DP FP operation (e.g. ADDPD xmm1, xmm2) op X1 X0 op Y0 xmm1 Scalar DP FP operation (e.g. ADDSD xmm1, xmm2) 38 Y1 or Y0 X1 or X0 Shuffle FP DP operation (2-bit imm) (e.g. SHUFPD imm2) SHUFPS xmm1, xmm2, imm8) ECE4100/6100 HyperThreading In Intel Xeon Processor and Intel Xeon MP Processor Enable Simultaneous Multi-Threading (SMT) Exploit ILP through TLP (—Thread-Level Parallelism) Issuing and executing multiple threads at the same snapshot Single P4 Xeon appears to be 2 logical processors Share the same execution resources Architectural states are duplicated in hardware 39 ECE4100/6100 Multithreading (MT) Paradigms Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Chip Fine-grained Coarse-grained Multithreading Multithreading Multiprocessor (CMP) (cycle-by-cycle (Block Interleaving) Interleaving) 40 Simultaneous Multithreading ECE4100/6100 More SMT commercial processors Intel Xeon Hyperthreading Supports 2 replicated hardware contexts: PC (or IP) and architecture registers New directions of usage Helper (or assisted) threads (e.g. speculative precomputation) Speculative multithreading Clearwater (once called Xtream logic) 8 context SMT “network processor” designed by DISC architect (company no longer exists) SUN 4-SMT-processor CMP? 41 ECE4100/6100 Speculative Multithreading  SMT can justify wider-than-ILP datapath  But, datapath is only fully utilized by multiple threads  How to speed up single-thread program by utilizing multiple threads?  What to do with spare resources? Execute both sides of hard-to-predictable branches Eager execution or Polypath execution Dynamic predication Send another thread to scout ahead to warm up caches & BTB Speculative precomputation Early branch resolution Speculatively execute future work Multiscalar or dynamic multithreading e.g. start several loop iterations concurrently as different threads, if data dependence is detected, redo the work Run a dynamic compiler/optimizer on the side Dynamic verification DIVA or Slipstream Processor 42

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Intel_P6_Arch