Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001 Embedded Systems Group IIT Delhi Overview • The VLIW code size expansion problem • What all such a framework needs to support? • Trimaran compiler infrastructure • The HPL-PD architecture • Extensions to the various modules of Trimaran • Results • Future work • Acknowledgements Embedded Systems Group IIT Delhi Choices for exploiting ILP • The architectural choices for utilizing ILP – Superscalar processors • • • • Try to extract ILP at run time Complex hardware Limited clock speeds and high power dissipation Not suited for embedded type of applications – VLIW processors • • • • Compiler has lot of knowledge about hardware Compiler extracts ILP statically Simplified hardware Possible to attain higher clock speeds Embedded Systems Group IIT Delhi Problems with VLIW processors • Complex compiler required to extract ILP from application program • Requires adequate support in hardware for compiler controlled execution • Code size expansion due to explicit NOPs if, – The application does not contain enough parallelism – The compiler is not able to extract parallelism from the application – Need for good instruction encoding and NOP compression schemes Embedded Systems Group IIT Delhi What all such a framework should support? • The framework should have quick retargetability • Studying the effect of a particular instruction encoding and decoding scheme on processor performance • Studying the code size minimization due to a particular instruction encoding scheme • Studying memory bandwidth requirements imposed by a particular instruction decoding scheme. Embedded Systems Group IIT Delhi Trimaran Compiler Infrastructure C Program Bridge Code IMPACT •ANSI C Parsing •Code profiling •Classical machine independent optimizations •Basic block formation ELCOR ELCOR IR •Machine dependent code optimizations STATISTICS •Compute and stall cycles •Cache stats •Spill code info SIMULATOR •Code scheduling •Register allocation •ELCOR IR to low level C files •HPL-PD virtual machine •Cache simulation •Performance statistics HMDES Machine Description Embedded Systems Group IIT Delhi Various modules of Trimaran - 1 • IMPACT – Developed by UIUC’s IMPACT group – Trimaran uses only the IMPACT front-end – Classical machine independent optimizations – Outputs a low level IR, Trimaran bridge code • ELCOR – Developed by HPL’s CAR group – It is the compiler backend – Performs registration allocation and code scheduling – Parameterized by HMDES machine description – Outputs ELCOR IR with annotated HPL-PD assembly Embedded Systems Group IIT Delhi Various modules of Trimaran - 2 • HMDES – Developed by UIUC’s IMPACT group – Specifies resource usage and latency information for an arch. – Input is translated to a low level representation – Has efficient mechanisms for querying the database – Does not specify instruction format information • HPL-PD Simulator – Developed by NYU’s REACT-ILP group – Converts ELCOR’s annotated IR to low level C representation – Processor performance and cache simulation – Generates statistics and execution trace Embedded Systems Group IIT Delhi Various modules of Trimaran - 3 Example ELCOR Operation in IR Op 7 ( ADD_W [ br<11 :I gpr 14>] [br<27 :I gpr 14> I<1> ] p<t> s_time( 3 ) s_opcode( ADD_W.0 ) attr(lc ^52) flags( sched ) ) Embedded Systems Group IIT Delhi Various modules of Trimaran - 4 • HMDES Sections – Field_Type e.g. REG, Lit etc. – Resource e.g. Slot0, Slot1 etc. – Resource_Usage e.g. RU_slot0 time( 0 ) – Reservation_Table e.g. RT_slot0 use( Slot0 ) – Operation_Latency e.g. lat1 ( time( 1 ) ) – Scheduling_Alternative e.g. (format(std1) resv(RT1) latency(lat1) ) – Operation e.g. ADD_W.0 ( Alt_1 Alt_2 ) – Elcor_Operation e.g. ADD_W( op( “ADD_W.0” “ADD_W.1” ) ) Embedded Systems Group IIT Delhi Various modules of Trimaran - 5 HPL-PD Simulator in detail REBEL Low level C files C libraries Emulation Library Code Processor HMDES Native Compiler Executable for the host platform Embedded Systems Group IIT Delhi Various modules of Trimaran - 7 HPL-PD Simulator in detail HPL-PD Virtual Machine Fetch Next Instruction Fetch Data Execute Instruction Instruction Accesses Data Accesses Dinero IV Cache Simulator Level I Instruction-Cache Level I Data-Cache Level II Unified Cache Embedded Systems Group IIT Delhi The HPL-PD architecture • Parameterized ILP architecture from HP Labs • Possible to vary, – Number and types of FUs – Number and types of registers – Width of instruction words – Instruction latencies • Predicated instruction execution • Compiler visible cache hierarchy • Result multicast is supported for predicate registers • Run time memory disambiguation instructions Embedded Systems Group IIT Delhi The HPL-PD memory hierarchy Registers L1 Cache Data Prefetch Cache L2 Cache Main Memory Embedded Systems Group •Independent of L1 Cache •Used to store large amount of cache polluting data •Doesn’t require sophisticated cache replacement mechanism IIT Delhi The Framework Decoder Model HMDES Perf. Stats TRIMARAN ASSEMBLER (using NJMC) Cache. Stats Obj. File Instruction Address or Next Instr Request Code Size Bytes Fetched DISASSEMBLER (using NJMC) Embedded Systems Group IIT Delhi Studying impact on performance • The HMDES modeling of decompressor, – Add a new resource with latency of decoder – Add a new resource usage section for this decoder – Add this resource usage to all the HPL-PD operations • In the results there are two decompressor units with latency = 1 • The latency of decompressor should be estimated or generated using actual simulation. Embedded Systems Group IIT Delhi Studying code size minimization - 1 A simple template based instruction encoding scheme Issue Slots MUL_OP Format ADD_W and L_W_C1_C1 IALU.0 IALU.1 FALU.0 MU.0 BU.0 MUL_OP OPCODE & OPERANDS OPCODE & OPERANDS 00010 IOP ; Sgpr1, Slit1, Dgpr2 MemOP ; Sgpr1, Dgpr1 ….. •Multi-ops are decided after profiling the generated assembly code. •Multi-op field encodes: •Size and position of each Uni-op •Number, size and position of operands of each Uni-op Embedded Systems Group IIT Delhi Studying code size minimization - 2 • Instrumenting ELCOR to generate assembly code 1. Arrange all the ops in IR in forward control order 2. Choose the next basic block and initialize cycle to 0 3. Walk the ops of this BB and dump those with the s_time = cycle 4. If BBs are left goto step 2 5. Dump the global data • Actual instruction encoding is done using procedures created by NJMC Embedded Systems Group IIT Delhi Studying code size minimization - 3 The New Jersey Machine Code Toolkit • Deals with bits at symbolic level • Can be used to write assemblers, disassemblers etc. • Supports concatenation to emit large binary data • Representation is specified in SLED • Has been used to write assemblers for Sparc, i486 etc. • VLIW instructions need to be broken up into 32 bit (max) size tokens • Emitted binary data must end on a 8 bit boundary Embedded Systems Group IIT Delhi Studying code size minimization - 4 Machine specifications in SLED bit 0 is least significant fields of TOK32 (32) Dgpr_1 0:3 Slit_1_part1 4:31 fields of TOK8 (16) Slit_1_part2 0:3 Sgpr_1 4:7 IOP 8:11 tmpl 12:14 patterns IOP_pats is any of [ ADD MUL SUB ], which is tmpl = 1 & IOP = { 0 to 2 } constructors IOP_pats Sgpr_1, Slit_1, Dgpr_1 is IOP_pats & Sgpr_1 & Slit_1_part2 = Slit_1 @[28:31]; Slit_2_part1 = Slit_1 @[0:27] & Dgpr_1 Embedded Systems Group IIT Delhi Studying code size minimization - 5 Toolkit encoder output ADD( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); MUL( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); SUB( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); Specifying matcher for disassembler match | ADD( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something | MUL( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something | SUB( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something endmatch Embedded Systems Group IIT Delhi Studying code size minimization - 6 • The matcher application needs functions for fetching data • Bit ordering is different on little and big endian machines • The matcher fails when large number of complex templates are given • Breaking large sized multi-ops across 32 bit tokens makes the representation messy and error prone • Specifying addresses for forward branches requires two passes Embedded Systems Group IIT Delhi Studying impact on memory bandwidth - 1 The Typical VLIW Pipeline Instruction Decode Instruction Fetch Store Results Embedded Systems Group Align Decompress Execute Decode DF/AG IIT Delhi Studying impact on memory bandwidth - 2 • The cache simulation requires the generation of, – Instruction address – No. of bytes to fetch • Instruction address can be generated by disassembling the instructions at run time and keeping track of jumps • The matcher application returns the number of bytes required to disassemble an instruction • The disassembled instruction can be compared with the instruction issued to check correctness Embedded Systems Group IIT Delhi Studying impact on memory bandwidth - 3 • Run time verification of disassembled instructions can be turned off for faster simulation • Due to restricted size of matcher results could not be obtained for larger programs • Memory access addresses and bytes to fetch have been generated by hand for SumToN application Embedded Systems Group IIT Delhi Results - Impact on code size (Strcpy) 370 400 280 300 207 200 X86 Sparc HPL-PD 100 0 Embedded Systems Group IIT Delhi Results - Impact on code size (SumToN) 200 159 150 97 100 X86 Sparc 59 HPL-PD 50 0 Embedded Systems Group IIT Delhi Results - Size of SLED specification for various archs. 20000 15553 15000 10000 11500 13199 X86 Sparc HPL-PD 5000 0 Embedded Systems Group IIT Delhi Results 350 300 250 200 150 100 50 0 Cache performance comparison (SumToN) 320 256 196 160 Canonical Encoded 1 Embedded Systems Group 2 IIT Delhi Future work • Need for automation in most parts of the framework • Better representation for VLIW instructions than SLED – Unlimited token size – Facility to bind one field with multiple patterns • Methodology for predicting latency for decompressor • Framework for finding the optimal instruction formats Embedded Systems Group IIT Delhi Acknowledgements • Prof. M.Balakrishnan and Prof. Anshul Kumar • Rodric M. Rabbah, Georgia Institute of Technology • Shail Aditya, HP Labs • All the friends at Philips Lab. for stimulating discussions Embedded Systems Group IIT Delhi