Download Effective Compilation Support for Variable Instruction Set Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Effective Compilation Support for
Variable Instruction Set
Architecture
Jack Liu
Timothy Kong
Fred Chow
Cognigine Corp.
www.cognigine.com
1
Outline
1.
2.
3.
4.
VISC Architecture
Compile-time Configurable Code Generation
Managing the Dictionary
Concluding Remarks
2
1
Configurable Computing
Motivation
• Higher performance
• processor and instruction set customized to
type of application
• Lower hardware cost
• non-essential features excluded
• Shorter time-to-market
3
1
Variable Instruction Set Architecture
(VISC ArchitectureTM)
A new approach to configurable computing:
• Fixed processor hardware
• Many types of operations provided
• Numerous instruction variants (CISC-style)
• Per-program instruction set tailoring during
compile time
4
1
Background of this work
Cognigine CGN16100 Network Processor
• Single-chip, fully programmable network processor
• Processing cores:
16 Re-configurable Communications Units (RCU)
processor cores
• VISC architecture
• 4 64-bit parallel execution units
• Multi-threaded
• 512 KB on-chip memory (text and data)
5
1
VISC ArchitectureTM
256 entries
Dictionary (instruction set
for current program)
dictionary entry:
32-bit: 2 operations
64-bit: 4 operations
128-bit: 8 operations
instruction opcode opnd0 opnd1 opnd2 opnd3
opcode: 8-bit
6
1
Motivation for VISC Architecture
1. Efficient way to encode/decode the many operation
variants with different addressing modes
• Not all used in each program
2. High instruction encoding density
• Small opcode bit count
• Operands shared among multiple operations
3. Simplified control logic for VLIW-style ILP
• Up to 8 operations per cycle
7
1
Operation Specification
In Dictionary Entry (only specified once):
1. Operation name
2. Operation variants:
• Signed and unsigned
• Operand and result sizes — 8-bit, 16-bit, 32-bit, 64-bit
• Support different sizes among operand(s) or result
• Vector — 64v8, 64v16, 64v32, 32v8, 32v16
3. Data path to each operand/result
In Instruction:
1. Operands’ encoding formats
2. Actual operands
8
1
RCU Architecture
• 5 Stage Pipeline
• 4-way multi-threaded
• Hardware RSF synchronization
• 128 bit reconfigurable address path
• 256 bit reconfigurable data path
RSF
“Back-side” Ports
Pointer
File
64
Packet Buffers
Dictionary
Data
Memory
Registers, Scratch Memory
256
64
Dictionary
Decode
128
Source
Route
Source
Route
Source
Route
Source
Route
Execution
Unit
Execution
Unit
Execution
Unit
Execution
Unit
64
64
64
64
Pipeline & Thread
Control
Address
Calculation
128
Instruction
Cache
Data Flow Synchronization
RSF Connector
9
1
Roles of Compiler for VISC Architecture
1. Determine best instruction set stored in dictionary for
best execution time performance
2. Generate optimized code sequence based on best
instruction set
3. Cater to various hardware limitations:
• Dictionary limit
• Data path constraints
• Dictionary and Instruction encoding constraints
10
1
New Compilation Approach:
Configurable Code Generation
• Exact form of generated instructions
decided in the last instruction scheduling
phase
• Direct result of instruction compaction
based on what is allowed by the hardware
11
1
Compiler Implementation Method
• Retarget SGI Pro64 (Open64) compiler to an
Abstract Machine
• Code generator operates on an Abstract Operation
Representation
– Code generation optimizations left intact
• Add new Instruction and Dictionary Finalization (IDF)
phase as post-pass
IDF Phase 1:
– Instruction scheduling and folding
– Abstract operations converted to target code sequence
IDF Phase 2:
– Output VISC instructions and dictionary entries
12
1
Compiler Phase Structure
C
GNU / Pro64TM Front-end
Pro64TM
Back-end
WHIRL Optimizer
Code Generator
IDF
Assembly Program:
Instructions
Dictionary
13
1
Abstract Operation Representation
(AOR)
Each operation corresponds to a micro-operation in the
core execution units
• RISC-like formats
–
–
–
–
r1 = op r2, r3
r2 = load <offset>(<base>)
store r2 <offset>(<base>)
r1 = loadimm <imm>
• Optimizations in AOR reflected in final code
• No pre-disposition of compiler to any specific
instruction format
14
1
Multiple AOR ops can be combined to single
target operation
Operations taking immediate operand
r2 = move <imm>
=>
r3 = addi r1 <imm>
r3 = add r1, r2
Operations supporting memory operands
r2 = load 4(sp)
=>
r3 = add r1 4(sp)
r3 = add r1, r2
Post incre/decre memory operations
r2 = load 0(r1)
=>
r2 = load 0(r1++)
r1 = addi r1, 4
Branches on condition codes
r1 = add r2, r3
...
r1 = add r2, r3
compare (r1 != 0)
=>
br.z label (only if immediately after)
br.z label
Others
15
1
IDF Approach
Instruction scheduling + following tasks:
– Instruction folding
– Opcode selection
– Modelling of irregular hardware constraints
– Modelling of encoding constraints
– Monitoring of states of condition codes and
transient registers
– Keeping track of dictionary contents
Use enumeration (branch and bound) approach
16
1
Example of IDF Processing
Dictionary
Input
add xor sub nop
$w80 = move 0x55
$w91 = move 0xf8
$w70 = add $w70, $w80
$w71 = xor $w92, $w80
$w90 = sub $w92, $w91
store 8($p1) = $w90
3 add xor sub nop
instruction
op3
8($p1) $w70
0x55
0xf8
move and store instructions subsumed
• $w71, $w92 mapped to transient registers
•
17
1
IDF Scheduling Algorithm
Input:
Sequence of operations
in BB
start
Estimate initial
boundsch
Search for schedule
with length <=
boundsch
boundsch= boundsch+1
no
To speed up the search:
Shrink solution space by:
– Coming up with high
initial boundsch
– Prune useless search
paths continuously
• Tight hardware
constraints help
succeed?
yes
end
18
1
Managing the Dictionary
• Dictionary usage increases due to:
– Program size: more variety of operations
– High ILP: more combination of operations
– Library code linked in
• Currently, dictionary contents fixed for each executable
• Role of linker:
– Merge dictionary entries with identical contents across
files/libraries
– Error message on dictionary overflow
• Role of compiler:
– Maximize dictionary entry re-use
19
1
Dictionary Compilation
Strategy:
• Keep track of existing dictionary entries during compilation
– Extract dictionary entries from:
• Libraries and .s files being linked
• .o files compiled before current file
Example: cc a.c b.o c.s
– Maintain table of existing dictionary entries
– Add to table as new entries are generated
• Re-use existing dictionary entries
• Bias scheduling towards dictionary conservation as
dictionary fills up
20
1
User Control of Dictionary Compilation
Best program performance demands near-full
dictionary.
When dictionary overflow, needs to re-compile.
Provide user control mechanisms:
– Trade-off between dictionary consumption and program
performance
– Command line option: -CG:dict_usage=n n = 0…10
– Embedded in code: #pragma dict_usage n
dict_usage is dictionary budget guideline for IDF
– Low dict_usage:
• Less new dictionary entries created
• Low ILP
– High dict_usage:
• Tighter instruction schedule
• More dictionary entries created
21
1
IDF Support of dict_usage
Additional search goal bounddict
– Number of new dictionary entries allowed for
current BB
– Automatically adjust lower with more pre-existing
entries
When bounddict reached during enumeration, disallow
creating new dictionary entry (unless single operation)
800
700
600
500
400
300
200
100
0
instructions
dict entries
10
8
3
2
0
dict_usage
22
1
Experimental Results
Summary (with dict_usage=10):
• ILP from IDF scheduling: 1.38 ops per instruction
• ILP from relaxed scheduling: 1.51 ops per instruction
• 23% of all subsumable operations subsumed
• Each dictionary entry referred to by 2.63 instructions
(statically)
• Scheduling via enumeration: 100 times slower than
one-pass schedulers
• Compilation time: 1 to 2 minutes per program
23
1
Concluding Remarks
• VISC approach most suitable as embedded processors
–
–
–
–
Limited program size
Dictionary space less of an issue
Slow compilation tolerable
CISC-style instructions enable small code size
• Compilation support key to deploying applications on VISC
– Very hard to write in assembly language
– Advanced optimizations performed by compiler
– Dictionary managed by compiler with user hints
• Compile-time configurable code generation enables RISC
compilation techniques to generate CISC output
24
1