Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Effective Compilation Support for Variable Instruction Set Architecture Jack Liu Timothy Kong Fred Chow Cognigine Corp. www.cognigine.com 1 Outline 1. 2. 3. 4. VISC Architecture Compile-time Configurable Code Generation Managing the Dictionary Concluding Remarks 2 1 Configurable Computing Motivation • Higher performance • processor and instruction set customized to type of application • Lower hardware cost • non-essential features excluded • Shorter time-to-market 3 1 Variable Instruction Set Architecture (VISC ArchitectureTM) A new approach to configurable computing: • Fixed processor hardware • Many types of operations provided • Numerous instruction variants (CISC-style) • Per-program instruction set tailoring during compile time 4 1 Background of this work Cognigine CGN16100 Network Processor • Single-chip, fully programmable network processor • Processing cores: 16 Re-configurable Communications Units (RCU) processor cores • VISC architecture • 4 64-bit parallel execution units • Multi-threaded • 512 KB on-chip memory (text and data) 5 1 VISC ArchitectureTM 256 entries Dictionary (instruction set for current program) dictionary entry: 32-bit: 2 operations 64-bit: 4 operations 128-bit: 8 operations instruction opcode opnd0 opnd1 opnd2 opnd3 opcode: 8-bit 6 1 Motivation for VISC Architecture 1. Efficient way to encode/decode the many operation variants with different addressing modes • Not all used in each program 2. High instruction encoding density • Small opcode bit count • Operands shared among multiple operations 3. Simplified control logic for VLIW-style ILP • Up to 8 operations per cycle 7 1 Operation Specification In Dictionary Entry (only specified once): 1. Operation name 2. Operation variants: • Signed and unsigned • Operand and result sizes — 8-bit, 16-bit, 32-bit, 64-bit • Support different sizes among operand(s) or result • Vector — 64v8, 64v16, 64v32, 32v8, 32v16 3. Data path to each operand/result In Instruction: 1. Operands’ encoding formats 2. Actual operands 8 1 RCU Architecture • 5 Stage Pipeline • 4-way multi-threaded • Hardware RSF synchronization • 128 bit reconfigurable address path • 256 bit reconfigurable data path RSF “Back-side” Ports Pointer File 64 Packet Buffers Dictionary Data Memory Registers, Scratch Memory 256 64 Dictionary Decode 128 Source Route Source Route Source Route Source Route Execution Unit Execution Unit Execution Unit Execution Unit 64 64 64 64 Pipeline & Thread Control Address Calculation 128 Instruction Cache Data Flow Synchronization RSF Connector 9 1 Roles of Compiler for VISC Architecture 1. Determine best instruction set stored in dictionary for best execution time performance 2. Generate optimized code sequence based on best instruction set 3. Cater to various hardware limitations: • Dictionary limit • Data path constraints • Dictionary and Instruction encoding constraints 10 1 New Compilation Approach: Configurable Code Generation • Exact form of generated instructions decided in the last instruction scheduling phase • Direct result of instruction compaction based on what is allowed by the hardware 11 1 Compiler Implementation Method • Retarget SGI Pro64 (Open64) compiler to an Abstract Machine • Code generator operates on an Abstract Operation Representation – Code generation optimizations left intact • Add new Instruction and Dictionary Finalization (IDF) phase as post-pass IDF Phase 1: – Instruction scheduling and folding – Abstract operations converted to target code sequence IDF Phase 2: – Output VISC instructions and dictionary entries 12 1 Compiler Phase Structure C GNU / Pro64TM Front-end Pro64TM Back-end WHIRL Optimizer Code Generator IDF Assembly Program: Instructions Dictionary 13 1 Abstract Operation Representation (AOR) Each operation corresponds to a micro-operation in the core execution units • RISC-like formats – – – – r1 = op r2, r3 r2 = load <offset>(<base>) store r2 <offset>(<base>) r1 = loadimm <imm> • Optimizations in AOR reflected in final code • No pre-disposition of compiler to any specific instruction format 14 1 Multiple AOR ops can be combined to single target operation Operations taking immediate operand r2 = move <imm> => r3 = addi r1 <imm> r3 = add r1, r2 Operations supporting memory operands r2 = load 4(sp) => r3 = add r1 4(sp) r3 = add r1, r2 Post incre/decre memory operations r2 = load 0(r1) => r2 = load 0(r1++) r1 = addi r1, 4 Branches on condition codes r1 = add r2, r3 ... r1 = add r2, r3 compare (r1 != 0) => br.z label (only if immediately after) br.z label Others 15 1 IDF Approach Instruction scheduling + following tasks: – Instruction folding – Opcode selection – Modelling of irregular hardware constraints – Modelling of encoding constraints – Monitoring of states of condition codes and transient registers – Keeping track of dictionary contents Use enumeration (branch and bound) approach 16 1 Example of IDF Processing Dictionary Input add xor sub nop $w80 = move 0x55 $w91 = move 0xf8 $w70 = add $w70, $w80 $w71 = xor $w92, $w80 $w90 = sub $w92, $w91 store 8($p1) = $w90 3 add xor sub nop instruction op3 8($p1) $w70 0x55 0xf8 move and store instructions subsumed • $w71, $w92 mapped to transient registers • 17 1 IDF Scheduling Algorithm Input: Sequence of operations in BB start Estimate initial boundsch Search for schedule with length <= boundsch boundsch= boundsch+1 no To speed up the search: Shrink solution space by: – Coming up with high initial boundsch – Prune useless search paths continuously • Tight hardware constraints help succeed? yes end 18 1 Managing the Dictionary • Dictionary usage increases due to: – Program size: more variety of operations – High ILP: more combination of operations – Library code linked in • Currently, dictionary contents fixed for each executable • Role of linker: – Merge dictionary entries with identical contents across files/libraries – Error message on dictionary overflow • Role of compiler: – Maximize dictionary entry re-use 19 1 Dictionary Compilation Strategy: • Keep track of existing dictionary entries during compilation – Extract dictionary entries from: • Libraries and .s files being linked • .o files compiled before current file Example: cc a.c b.o c.s – Maintain table of existing dictionary entries – Add to table as new entries are generated • Re-use existing dictionary entries • Bias scheduling towards dictionary conservation as dictionary fills up 20 1 User Control of Dictionary Compilation Best program performance demands near-full dictionary. When dictionary overflow, needs to re-compile. Provide user control mechanisms: – Trade-off between dictionary consumption and program performance – Command line option: -CG:dict_usage=n n = 0…10 – Embedded in code: #pragma dict_usage n dict_usage is dictionary budget guideline for IDF – Low dict_usage: • Less new dictionary entries created • Low ILP – High dict_usage: • Tighter instruction schedule • More dictionary entries created 21 1 IDF Support of dict_usage Additional search goal bounddict – Number of new dictionary entries allowed for current BB – Automatically adjust lower with more pre-existing entries When bounddict reached during enumeration, disallow creating new dictionary entry (unless single operation) 800 700 600 500 400 300 200 100 0 instructions dict entries 10 8 3 2 0 dict_usage 22 1 Experimental Results Summary (with dict_usage=10): • ILP from IDF scheduling: 1.38 ops per instruction • ILP from relaxed scheduling: 1.51 ops per instruction • 23% of all subsumable operations subsumed • Each dictionary entry referred to by 2.63 instructions (statically) • Scheduling via enumeration: 100 times slower than one-pass schedulers • Compilation time: 1 to 2 minutes per program 23 1 Concluding Remarks • VISC approach most suitable as embedded processors – – – – Limited program size Dictionary space less of an issue Slow compilation tolerable CISC-style instructions enable small code size • Compilation support key to deploying applications on VISC – Very hard to write in assembly language – Advanced optimizations performed by compiler – Dictionary managed by compiler with user hints • Compile-time configurable code generation enables RISC compilation techniques to generate CISC output 24 1