Download Compiler Optimizations for Modern Hardware Architectures - Part II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
Compiler Optimizations for Modern
Hardware Architectures - Part II
Compiling for the Intel® Itanium® –
A Processor of EPIC Proportions
Bob Wall
CS 550 (Fall 2003) Class Presentation
Acknowledgements
• As part of the presentation, I refer to slides from a
presentation by Dr. Yong-fong Lee of Intel, "An Overview
of IA-64 Architectural Features and Compiler
Optimization“, presented at the 14th International
Conference on Parallel and Distributed Computing
Systems (PDCS-2001), Aug. 8-10, 2001, Richardson, TX.
• A link to the PDF file is on my presentation site – and here
too:
http://www.cs.montana.edu/~bwall/cs550/ovrvw_ia64_arch_features_cmpl_opt.pdf
What is EPIC?
• Explicitly Parallel Instruction-set Computing – a new
hardware architecture paradigm that is the successor to
VLIW (Very Long Instruction Word).
• EPIC features:
–
–
–
–
–
–
Parallel instruction encoding
Flexible instruction grouping
Large register file
Predicated instruction set
Control and data speculation
Compiler control of memory hierarchy
History
• A timeline of architectures:
Hard-wired
discrete logic
Microprocessor
(advent of VLSI)
CISC
(microcode)
RISC
(microcode bad!)
EPIC
Superpipelined,
Superscalar
VLIW
• Hennessy and Patterson were early RISC proponents.
• VLIW (very long instruction word) architectures, 1981.
– Alan Charlesworth (Floating Point Systems)
– Josh Fisher (Yale / Multiflow – Multiflow Trace machines)
– Bob Rau (TRW / Cydrome – Cydra-5 machine)
History (cont.)
• Superscalar processors
– Tilak Agerwala and John Cocke of IBM coined term
– First microprocessor (Intel i960CA) and workstation (IBM
RS/6000) introduced in 1989.
– Intel Pentium in 1993.
• Foundations of EPIC - HP
– HP FAST (Fine-grained Architecture and Software Technologies)
research project in 1989 included Bob Rau; this work later
developed into the PlayDoh architecture.
– In 1990 Bill Worley started the PA-Wide Word project (PA-WW).
Josh Fisher, also hired by HP, made contributions to these projects.
– HP teamed with Intel in 1994 – announced EPIC, IA-64
architecture in 1997.
– First implementation was Merced / Itanium, announced in 1999.
IA-64 / Itanium
• IA-64 is not a complete architecture – it is more a set of high-level
requirements for an architecture and instruction set, co-developed by
HP and Intel.
• Itanium is Intel’s implementation of IA-64. HP announced an IA-64
successor to their PA-RISC line, but apparently abandoned it in favor
of joint development of the Itanium 2.
• Architectural features:
– No complex out-of-order logic in processor – leave it for compiler
– Provide large register file, multiple execution units (again, compiler has
visibility to explicitly manage)
– Add hardware support for advanced ILP optimization: predication,
control and data speculation, register rotation, multi-way branches,
parallel compares, visible memory hierarchy
– Load / store architecture – no direct-to-memory instructions
Hardware Resources
• General (user-accessible) registers
– 128 64-bit general registers; each has 1-bit NaT flag attached
(GR0 hardwired to 0)
– 128 82-bit floating-point registers
– 64 1-bit predicate registers (p0 hardwired to 0)
– 8 64-bit branch registers
• Special application registers
– Current Frame Marker (CFM), Loop Counter (LC), Epilog
Counter (EC)
• Advanced Load Address Table (ALAT)
• Register Stack Engine (RSE)
IA-64 Instruction Formats
• Instruction Bundle – three instructions per 128-bit “word”
Instr 2 – 41 bits
Instr 1 – 41 bits
Instr 0 – 41 bits
template – 5 bits
• Template indicates how each instruction is used
– Instructions are divided into types:
•M (memory)
•I (integer)
•B (branch)
•F (float)
•L+X (long immed.)
– Predefined templates
•Regular – MII, MLX, MMI,
MFI, MMF
•Stop – MI_I, M_MI (treated
as separate groups)
•Branch – MIB, MMB, MFB, MBB, BBB
•All come with or without stop at end
Instruction Groups
• Sequence of instructions with no register dependencies
(can all be executed in parallel)
– Write after read dependencies are allowed
– Memory operations still require sequential execution
• Stops delimit groups, so a group of parallel instructions
can span bundles (allows new hardware implementation to
add functional units – it would still be able to execute code
compiled for a smaller number of units).
– Itanium 2 can issue up to six instructions simultaneously
Predication
• The IA-64 includes 64 special-purpose registers that can be
used to predicate most of the normal instructions. For
example, this allows the following conversion:
if (a > b) then
c=c+a
else
c=c+b
cmp.lt p1, p2 = a, b
(p1) c = c + a
(p2) c = c + b
• Most special compare / test instructions write the result and
its complement into two different predicate registers
• Predication converts control dependencies into data
dependencies, allowing more scheduling options
Predication (cont.)
• The removal of branches has the following effects:
– Reduces branch mis-predicts and pipeline bubbles
– Improves instruction fetch efficiency
– Better utilizes wide-issue instruction efficiency
• The down-side
– Instruction cache “polluted” with non-executed code
– Execution units used to execute instructions with false predicates
– Need to determine when it will be beneficial – determine hard-topredict branches
Parallel Compares
• Used in conjunction with predication – allows multiple
checks to be executed in a single instruction group
• Add .and and .or suffixes to comparison instructions
• Example:
– If ( A & B & C )
{ S };
– cmp.eq p1 = r0,r0 ;; // initialize p1=1
cmp.ne.and p1 = rA,0
cmp.ne.and p1 = rB,0
cmp.ne.and p1 = rC,0
(p1) S
Multi-way Branches
• Also used in conjunction with predication – allows
multiple branch targets within a single instruction group
• Example:
– (p1) br.cond target_1
(p2) br.cond target_2
(p3) br.call b1
• Takes the branch associated with the first true condition, or
falls through if no predicate set
Control Speculation
• The IA-64 instruction set allows speculative loads – load a
register with a value from memory, but do not generate any
exceptions. Exceptions can be deferred; when the program
is ready to use the register, it executes an instruction that
checks - if an exception was generated, it branches to
compiler-generated patch-up code to re-execute the load
and jump back to the point where execution should
continue.
• Allows “hoisting” of loads above branches – provides
more options for concurrent instruction scheduling. In
conjunction with parallel instruction execution, masks
memory latency.
Control Speculation (cont.)
• Hardware implementation
– Add extra NaT bit to each general register, special NaTVal value
for floating point register
– Add speculative load (ld.s) instruction (sets NaT if exception
occurs), and speculation check (chk.s) instruction to determine if
exception occurred earlier (conditional branch if NaT).
– NaT propagates through operations
• The down-side
–
–
–
–
Increased code size
Extended live range for register
Might unnecessarily execute ld.s
Don’t speculatively move load out of unlikely block
Data Speculation
• The IA-64 instruction set also allows advanced loads –
load a register from memory, but remember that it might
contain a value that is subsequently overwritten by a store.
The program can later check to see if the overwrite took
place – if so, it can reload the value and continue.
• Allows “hoisting” of loads above stores – also provides
more opportunities for parallelism and masking of memory
latency.
Data Speculation (cont.)
• Hardware implementation
– Add advance load (ld.a) instruction – perform the load, but store
the physical address and target register in the ALAT. (Also
combined ld.sa.)
– Add check load (ld.c) instruction – check whether the specified
address / register are in the ALAT. If not, branch to compilergenerated patch-up code to re-execute the load and jump back.
– ALAT is an associative lookup table. Note that it works on
physical addresses, rather than virtual ones.
• The down-side
– Increased code size
– Extended live range for register
– Need to avoid for likely conflicts
Software Pipelining
• Similar to hardware pipelining – overlap executions of
different pieces of a loop. Requires multiple execution
units.
• Problem with loop unrolling / pipelining – requires a lot of
registers.
• Hardware support:
– Rotating predicate registers to embed the prolog and epilog of the
software pipeline within the main loop – minimizes code
expansion.
– Rotating registers can be used to avoid register renaming within
unrolled loops, further minimizing code expansion
– Loop Count (LC) and Epilog Count (EC) registers, special loop
branch instruction to rotate registers
Software Pipelining (cont.)
• Modulo scheduling algorithm
– Divide loop into prolog (filling pipeline), kernel (steady state), and
epilog (flush pipeline)
• Issues
– Loops with multiple exits (either need to convert to single exit or
generate additional epilogs)
– Complicates control and data speculation
The Register Stack
• Portion of process address space set aside as backing store
for registers – spill to this area.
• Processor supports a register stack frame similar to normal
memory stack frame.
– First 32 registers are static. Remaining 96 form the register stack.
– Current Frame Marker (CFM) and Top of Stack (TOS) registers
used to manipulate stack frames
– Called routine calls alloc instruction to indicate how many
registers it requires. This allocates the next frame and updates the
registers.
– Code uses virtual register names (i.e. always uses register 32 as the
first register in its frame) – processor maps to physical register
The Register Stack (cont.)
– Register frame divided into fixed region, rotating region (if
needed), and output region (for passing parameters to other
functions)
– The incoming / outgoing registers allow up to 8 parameters – can
eliminate any memory access for procedure calls.
• When register stack overflows (wraps around all 96
registers), the Register Stack Engine (RSE) automatically
stalls the processor and spills registers to the backing store.
• The RSE also automatically fills registers from backing
store when room is available.
• Flexibility in IA-64 spec to make RSE more aggressive –
can asynchronously spill / fill when memory bandwidth
available.
Memory Hierarchy Control
• Compiler can insert prefetch hints (lfetch) in instruction
stream.
• Load, store, and prefetch instructions can include postincrement – will also cause cache load of line containing
post-incremented address
What’s a Compiler to Do?
• Many new architectural features are based on improving
opportunities for ILP. Obviously, there are high
expectations for the compiler to perform good instruction
scheduling.
• Scheduler should probably drive many other optimizations.
• Tradeoff between ILP and code expansion.
• The choice of making the compiler do a lot of the work is
probably a good one – it has much more context available
to make decisions. However, it lacks the knowledge of the
dynamic run-time environment.
Conclusions
•
•
•
•
x86 architecture – NOT COOL!!!
IA-64 architecture – WAY COOL!!!
Could fuel a new round of “compiler wars”?
Future directions (some speculation on my part):
– Out of order execution in hardware
– Possibly introduce some dynamic optimization capability – base
speculation on runtime profile (i.e. remove speculation if inside a
rarely executed block)
And Finally …
•
Q: How many Intel architects does it take to change a lightbulb ?
A: None, they have a predicating compiler that eliminates lightbulb
dependencies. If the dependencies are not entirely eliminated, they have four
levels of prediction to determine if you need to replace the lightbulb.
•
Q: How many AMD architects does it take to change a lightbulb?
A: I have no idea, but they'll do anything to stick twice as many lightbulbs into
the same socket and ensure that it glows brighter than the bulb Intel replaced.
•
Q: How many Apple PowerPC architects does it take to change a
lightbulb?
A: Our Power G4 processor at 733MHz, with the phenomenally powerful
velocity engine(TM) technology, has the power to burn. It'll burn CDs, DVDs,
and probably light up your room as well. You don't need a lightbulb. Oh, by
the way, it'll burn Pentiums too, so you don't need Intel to replace your
lightbulb.
(Borrowed from http://www.ece.virginia.edu/~gjp5j/lightnote.htm)