Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Intel Itanium Architecture
Alex Crawford
Matt Ofalt
Brief History
● Merced - 2001
○ Slower than competing RISC and CISC
● McKinley (Itanium 2) - 2002
○ Fixed many of the performance problems on Merced
● Montecito (Itanium 2 9000) - 2006
○ Dual-core, roughly doubled performance
● Tukwila (Itanium 2 9300) - 2010
○ Quad-core, memory error correction
○ Shares its chipset with Nehalem
Itanium Overview
● 64-bit (path, data, address space)
● Explicit instruction-level parallelism (VLIW)
○ Static "superscaling"
● Compiler
○ Predication
○ Speculation
○ Branch Prediction
● 128 integer registers, 128 FP registers
● 30 functional execution units
Compilers
● Very difficult to write
○ Predication
○ Speculation
○ Branch Prediction
● This is the reason the architecture is failing,
but...
● Allows for huge improvements
● We like assembly better anyway, right?
IA-64 Instructions
● Issued in 128-bit "bundles"
● Three 41-bit instructions per bundle
● Template tells CPU which instructions
execute in parallel
○ Not constrained to just one bundle (8 inst. in parallel)
● Six instruction types
○
○
○
○
○
○
A
I
M
F
B
X
Integer ALU
Non-ALU integer
Memory
Floating-point
Branch
Extended
I/M unit
I unit
M unit
F unit
B unit
I/B unit
Execution Units
● I-Unit
○ Integer arithmetic
○ Shift and add
○ Logical
● M-Unit
○ Load and Store
○ Basic integer ALU operations
● B-Unit
○ Branches
● F-Unit
○ Floating point
IA-64 Assembly
[pq] mnemonic [.comp] dest = src [;;] [//]
(p0) cmp.eq
p1,p2=5,r7
// conditional 5 == r7
pq - 1-bit predicate register
mnemonic - name of instruction
comp - instruction completer
dest - one or more destination operands
src - one or more source operands
;; - instruction group stops
// - comment
Assembly Example
ld8
sub
add
st8
add
st8
r2
r4
r5
[r4]
r2
[r2]
=
=
=
=
=
=
[r3]
r10, r11 ;;
r2, r6
r7
;;
r2, 1
;;
r5
Assembly Example
ld8
sub
add
st8
add
st8
r2
r4
r5
[r4]
r2
[r2]
=
=
=
=
=
=
[r3]
r10, r11 ;;
r2, r6
r7
;;
r2, 1
;;
r5
IA-64 Instruction Format
128-Bit Bundle
Instruction 1
(41 bits)
Instruction 2
(41 bits)
Instruction 3
(41 bits)
Template
(5 bits)
41-Bit Instruction
Major Opcode
(4 bits)
Modifying Bits
(10 bits)
GR3
(7 bits)
GR2
(7 bits)
GR1
(7 bits)
PR
(6 bits)
Template Field
Template
Slot 1
Slot 2
Slot 3
Template
Slot 1
Slot 2
Slot 3
00000
M
I
I
01110
M
M
F
00001
M
I
I
01111
M
M
F
00010
M
I
I
10000
M
I
B
00011
M
I
I
10001
M
I
B
00100
M
L
X
10010
M
B
B
00101
M
L
X
10011
M
B
B
01000
M
M
I
10110
B
B
B
01001
M
M
I
10111
B
B
B
01010
M
M
I
11000
M
M
B
01011
M
M
I
11001
M
M
B
01100
M
F
I
11100
M
F
B
01101
M
F
I
11101
M
F
B
Branching on x86
if (G_LIKELY(random() != 1))
printf("not one");
call
cmp
je
mov
mov
call
if (G_UNLIKELY(random() != 1)) call
printf("not one");
cmp
jne
mov
leave
ret
8048440 <random@plt>
$0x1,%eax
8048524 <main+0x20>
$0x80485f0,%eax
%eax,(%esp)
8048410 <printf@plt>
8048440 <random@plt>
$0x1,%eax
8048524 <main+0x1B>
$0x0,%eax
Branching on IA-64
// random() -> r14
// not_ones -> r31
// ones
-> r32
if(random() != 1)
not_ones++;
else
ones++;
cmp.eq
(p1) adds
(p2) adds
p1,p2=1,r14
r31=1,r31
r32=1,r32
Data Speculation on IA-64
ld8.a r6
= [r8]
// other stuff
ld8.c r6
= [r8]
add
r5
= r6, r7
st8
[r18] = r5
;;
;;
Data Speculation on IA-64 (cont.)
ld8.a r6
= [r8]
// other stuff
add
r5
= r6, r7
// more stuff
chk.a r6, dirty
origin:
st8
[r18] = r5
dirty:
ld8.a r6
= [r8]
add
r5
= r6, r7
br
origin
;;
;;
;;
Data Speculation on x86
???
Rotating Register Stack
● r32-r127 can rotate ("register renaming")
● loop unrolling
● parameter passing
● overflows to memory
Performance
● Two bundles per cycle
○ Up to six instructions per cycle
○ Multiply-accumulate allows for 4 FLOPs per cycle
● Quad core
○ QPI (96 GiB/s)
○ Four memory controllers (34 GiB/s)
● Split L1 cache (16kiB Data, 16kiB Data)
● Unified L2 cache (256kiB)
● Unified L3 cache (24MiB)
Where do I buy one?
● $3,838 for the Tukwila 9350
● Servers in excess of $200,000
● newegg doesn't have them
Emulation
● ski
○ ski - ncurses-based IA-64 simulator
○ xski - ski with a GUI
○ http://ski.sourceforge.net/
● cross compile
○ ia64-gcc
○ ia64-as (live on the edge)
Questions?