Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Next Generation Instruction Set Architecture John Crawford, Intel Fellow Jerry Huck Director, Microprocessor Architecture Intel Corporation Manager and Lead Architect Hewlett-Packard Company ® 1 Objectives s Unveil the technology behind the next generation ISA l Today’s focus on architecture, not implementation s Context l History l Motivation s ISA Preview l A few key features l Benefits ® 2 Intel and HP Technology Alliance s Intel l Microprocessor / platform technology l 64-bit architecture definition s HP l E nterprise systems technology expertise l Architecture research advancements s Jointly defined next generation 64-bit instruction set l Instruction set specification l Compiler optimization l Performance simulation and projection ® 3 Instruction Set Architecture (ISA) Objectives s E nable industry leading system performance l Breakthrough performance l Headroom s E nable compatibility with today’s IA-32 software & PA-RISC software s Allow scalability over a wide range of implementations s Full 64-bit computing ® 4 Current State of The Art Arch. Research ??? Performance OOO SuperScalar CISC RISC ŸH/W detects implicit parallelism ŸH/W O-O-O scheduling & speculation ŸH/W renames 8-32 registers to 64+ ŸSimple, fixed length instructions ŸSequencing done by compiler ŸComplex, variable length instructions ŸSequencing done in hardware Time What’ ssnext, What’ next,beyond beyondtraditional traditionalarchitectures? architectures? ® 5 Current Performance Limiters: Branches Mispredicts limit performance Small blocks restrict code scheduling freedom s s Fragmentation l Poor utilization of wide machines l ld R1 IF use R1 unused execution slots st R2 THEN E LSE ld R4 use R4 ® 6 Current Performance Limiters: Latency to Memory Memory latency increasing relative to processor speed Load delay compounded by machine width s s l Latency hiding requires more parallelism in a wide machine branch branch is a barrier branch load load use use Scalar ® 4-way Superscalar 7 Current Performance Limiters: Extracting Parallelism Sequential execution model s original source code compiler sequential machine hardware code parallelized code parallelized parallelized code code multiple functional units . . . Compiler has limited, indirect view of hardware s Implicit Implicitparallelism parallelismlimits limitsperformance performance ® 8 Better Strategy: Explicit Parallelism Compiler exposes, enhances, and exploits parallelism in the source program and makes it explicit in the machine code. original source code compiler parallel machine code “Expose” “E xploit” “E nhance” ® 9 Next Generation Architecture Technology Arch. Research EPIC Performance OOO SuperScalar CISC RISC Time ® ŸH/W detects implicit parallelism ŸH/W O-O-O scheduling & speculation ŸH/W renames 8-32 registers to 64+ ŸSimple, fixed length instructions ŸSequencing done by compiler ŸComplex, variable length instructions. ŸSequencing done in hardware 10 E xplicitly P arallel Instruction Computing Next Generation Terminology s EPIC is the next generation technology l e.g., s IA-64 is the architecture that incorporates EPIC Technology l e.g., s RISC, CISC IA-32, PA-RISC Merced™ processor is the first IA-64-based implementation l e.g., Pentium® ® II processor, PA-8500 11 Key 64-bit ISA Features within IA-64 s Architecture Resources s Instruction Format s Predication s Speculation s (Branch Architecture) s (Floating-Point Architecture) s (Multimedia Architecture) s (Memory Management & Protection) s (Compatibility) ® 12 Architecture Resources Provide for Parallel E xecution & Scalability M E M O R Y 128 128 GRs GRs Execution Units 128 128 FRs FRs s Massively resourced - large register files l Traditional architectures are forced to rename registers s Inherently scalable - replicated function units s Explictly parallel - transistors used more effectively ® 13 Instruction Format: Explicit Parallelism s Breaking the sequential execution paradigm l Explicit instruction dependency: template l Flexibly groups any number of independent instructions s Explicitly scheduled parallelism l E nables compiler to create greater parallelism l Simplifies hardware by removing dynamic mechanisms l Fully interlocked- hardware provides compatibility 128-bit bundle 127 0 Instruction 2 s Instruction 1 Instruction 0 Template Modest code size expansion The Thenew newinstruction instructionformat formatenables enablesscalability scalabilityw/ w/compatibility compatibility ® 14 Branches Limit Performance Traditional Architectures: 4 basic blocks instr 1 instr 2 if then : p1, p2 <-cmp(a==b) jump p2 instr 3 instr 4 : jump instr 5 else instr 6 : instr 7 instr 8 : Control Controlflow flowintroduces introducesbranches branches ® 15 Predication if then instr 1 instr 2 : p1, p2 <-cmp(a==b) jump p2 (p1) instr 3 (p1) instr 4 : jump (p2) instr 5 else (p2) instr 6 : instr 7 instr 8 : The Thepredicate predicatecan canremove removebranches branches ® 16 Predication E nhances Parallelism Traditional Architectures : 4 basic blocks EPIC Architectures: Architectures: 1 basic block instr 2 if if : p1, p2 <-cmp <-cmp(a==b) (a==b) jump p2 instr 3 then instr 4 then instr 1 instr 2 : p1, p2 <- cmp(a==b) cmp(a==b) (p1) instr 3 (p1) instr 4 : : jump else (p2) instr 5 (p2) instr 6 : instr 5 else instr 6 instr 7 instr 8 : : instr 7 instr 8 : new Basic Block Predication Predicationenables enablesmore moreeffective effectiveuse useof ofparallel parallelhardware hardware ® 17 Predication: Features and Benefits s Compiler given larger scheduling scope l Nearly all instructions can be predicated l State updated if an instruction’s predicate is true, otherwise acts as a NOP l Compiler assigns predicates, compare instructions set them l Architecture provides 64 1-bit predicate registers (PR) s Predicated execution removes branches l Convert a control dependence to a data dependence l Reduce mispredict penalties s Parallel execution through larger basic blocks l Effective ® use of parallel hardware 18 Predication Increases Performance 100% Branches Removed 90% 80% Mispredicts Removed 70% 60% 50% 40% 30% 20% AVERAGE yacc wc qsort lex grep eqn cmp cccp sc ear alvinn compress eqntott li 0% espresso 10% On Onaverage, average,over overhalf halfof ofall allbranches branchesare areremoved removed Source: ISCA ‘95 S.Mahlke S.Mahlke,, et.al. ® 19 Memory Latency Causes Delays s Loads significantly affect performance l Often first instruction in dependency chain of instructions l Can incur high latencies Traditional Architectures instr 1 instr 2 ... jump_equ jump_equ Barrier Load use s ® Loads can cause exceptions 20 Speculation EPIC Architectures ;Exception Detection ld.s ld.s instr 1 instr 2 jump_equ jump_equ Propagate Exception ;Exception Delivery chk.s chk.s use Home Block s Separate load behavior from exception behavior l Speculative load instruction (ld .s)) initiates a load operation (ld.s and detects exceptions l Propagate an exception “token” (stored with destination register) from ld.s ld.s to chk.s chk.s l Speculative check instruction (chk .s)) delivers any (chk.s exceptions detected by ld.s ld.s ® 21 Speculation Minimizes the E ffect of Memory Latency Traditional Architectures instr 1 instr 2 ... jump_equ jump_equ EPIC Architectures ld.s ld.s instr 1 instr 2 jump_equ jump_equ Barrier chk.s chk.s use Load use s ;Exception Detection Propagate Exception ;Exception Delivery Home Block Give scheduling freedom to the compiler l Allows ld.s ld.s to be scheduled above branches l chk.s chk.s remains in home block, branches to exception is propagated ® 22 fixup code if an E xample: 8 Queens Loop if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Original Code 2 4 5 R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] P1,P2 <-cmp (R2==true) <-cmp(R2==true) <P2> br exit True 38% Mispred 43% 6 8 9 ld R4=[R3] P3,P4 <-cmp (R4==true) <-cmp(R4==true) <P4> br exit 72% 33% 10 12 ld R6=[R5] P5,P6 <-cmp (R5==true) <-cmp(R5==true) <P5> br then else 47% 39% 1 13 13 cycles 3 potential mispredicts ® 23 E xample: 8 Queens Loop if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Original Code 2 4 5 R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] P1,P2 <-cmp (R2==true) <-cmp(R2==true) <P2> br exit 6 8 9 ld R4=[R3] P3,P4 <-cmp (R4==true) <-cmp(R4==true) <P4> br exit 1 Speculation 1 2 4 5 6 7 10 12 13 ld R6=[R5] P5,P6 <-cmp (R5==true) <-cmp(R5==true) <P5> br then else 13 cycles 3 potential mispredicts ® 8 9 R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] ld.s ld.s R4=[R3] ld.s ld.s R6=[R5] P1,P2 <-cmp (R2==true) <-cmp(R2==true) <P2> br exit chk.s chk.s R4 P3,P4 <-cmp (R4==true) <-cmp(R4==true) <P4> br exit chk.s chk.s R6 P5,P6 <-cmp (R5==true) <-cmp(R5==true) <P5> br then else 9 cycles 3 potential mispredicts 24 E xample: 8 Queens Loop if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Speculation 1 2 4 5 6 7 R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] ld.s ld.s R4=[R3] ld.s ld.s R6=[R5] P1,P2 <-cmp (R2==true) <-cmp(R2==true) <P2> br exit chk.s chk.s R4 P3,P4 <-cmp (R4==true) <-cmp(R4==true) <P4> br exit Predication 1 2 4 5 6 8 9 chk.s chk.s R6 P5,P6 <-cmp (R5==true) <-cmp(R5==true) <P5> br then else 7 9 cycles 3 potential mispredicts ® R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] ld.s ld.s R4=[R3] ld.s ld.s R6=[R5] P1,P2 <-cmp (R2==true) <-cmp(R2==true) <P2> br exit <p1> chk.s chk.s R4 <p1> P3,P4 <-cmp (R4==true) <-cmp(R4==true) <P4> br exit <p3> chk.s chk.s R6 <p3> P5,P6 <-cmp (R5==true) True Mispred <-cmp(R5==true) <P5> br then 12% 16% else 7 cycles 1 potential mispredict 25 E xample: 8 Queens Loop if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Original Code Predication R1=&b[j] R1=&b[j] 1 R3=&a[i+j] R3=&a[i+j] 1 R5=&c[i-j+7] R5=&c[i-j+7] ld R2=[R1] ld R2=[R1] 2 ld.s ld.s R4=[R3] 2 4 P1,P2 <-cmp (R2==true) <-cmp(R2==true) ld.s ld.s R6=[R5] RESULT: Almost half cycles are RESULT: Almost halfthe therequired required cycles arereduced reduced <P2> br exit 5 P1,P2 <-cmp (R2==true) 4 <-cmp(R2==true) <P2> br exit are eliminated. and and2/3 2/3of ofthe thepotential potential mispredicts mispredicts are eliminated. <p1> chk.s chk.s R4 6 ld R4=[R3] 5 <p1> P3,P4 <-cmp (R4==true) <-cmp(R4==true) 8 P3,P4 <-cmp (R4==true) <-cmp(R4==true) <P4> br exit 9 <P4> br exit <p3> chk.s chk.s R6 6 <p3> P5,P6 <-cmp (R5==true) <-cmp(R5==true) <P5> br then ld R6=[R5] 10 7 else P5,P6 <-cmp (R5==true) <-cmp(R5==true) 12 <P5> br then 13 else 13 cycles 3 potential mispredicts ® 7 cycles 1 potential mispredict 26 EPIC is the Next Generation Technology Explicitly Parallel Instruction Computing s Features that enhance ILP s Explicit parallelism l Predication l ILP is explicit in machine code l Compiler schedules across a wide scope l Speculation l Others... l Binary compatibility across all family members EPIC s Resources for parallel execution l Many registers l Many functional units Performance ŸNo Binary Compatibility ŸNo Performance Scaling ŸCode Size Explosion CISC l Inherently scalable OOO SuperScalar ŸH/W detects Independent Instructions ŸH/W O-O-O Scheduling & Speculation ŸH/W Renames 8-32 Registers to 64+ RISC ŸSimple, fixed length instructions ŸSequencing done by Compiler ŸComplex, variable length instructions. ŸSequencing done in hardware Time ® 27 IA-64: EPIC Technology Applied s E nables industry leading performance and capability l Explicitly parallel: Beyond the limitations of current architectures l Inherently scalable, massively resourced: resourced: Provides headroom for future market requirements l Fully compatible: For existing applications and the future s Addresses server and workstation market requirements l Enterprise transaction processing l Decision support l Graphical imaging l Volume rendering l Many others The TheNext NextGeneration Generationin inComputer ComputerArchitecture Architecture ® 28