Download Intel Itanium Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Intel Itanium Architecture
Alex Crawford
Matt Ofalt
Brief History
● Merced - 2001
○ Slower than competing RISC and CISC
● McKinley (Itanium 2) - 2002
○ Fixed many of the performance problems on Merced
● Montecito (Itanium 2 9000) - 2006
○ Dual-core, roughly doubled performance
● Tukwila (Itanium 2 9300) - 2010
○ Quad-core, memory error correction
○ Shares its chipset with Nehalem
Itanium Overview
● 64-bit (path, data, address space)
● Explicit instruction-level parallelism (VLIW)
○ Static "superscaling"
● Compiler
○ Predication
○ Speculation
○ Branch Prediction
● 128 integer registers, 128 FP registers
● 30 functional execution units
Compilers
● Very difficult to write
○ Predication
○ Speculation
○ Branch Prediction
● This is the reason the architecture is failing,
but...
● Allows for huge improvements
● We like assembly better anyway, right?
IA-64 Instructions
● Issued in 128-bit "bundles"
● Three 41-bit instructions per bundle
● Template tells CPU which instructions
execute in parallel
○ Not constrained to just one bundle (8 inst. in parallel)
● Six instruction types
○
○
○
○
○
○
A
I
M
F
B
X
Integer ALU
Non-ALU integer
Memory
Floating-point
Branch
Extended
I/M unit
I unit
M unit
F unit
B unit
I/B unit
Execution Units
● I-Unit
○ Integer arithmetic
○ Shift and add
○ Logical
● M-Unit
○ Load and Store
○ Basic integer ALU operations
● B-Unit
○ Branches
● F-Unit
○ Floating point
IA-64 Assembly
[pq] mnemonic [.comp] dest = src [;;] [//]
(p0) cmp.eq
p1,p2=5,r7
// conditional 5 == r7
pq - 1-bit predicate register
mnemonic - name of instruction
comp - instruction completer
dest - one or more destination operands
src - one or more source operands
;; - instruction group stops
// - comment
Assembly Example
ld8
sub
add
st8
add
st8
r2
r4
r5
[r4]
r2
[r2]
=
=
=
=
=
=
[r3]
r10, r11 ;;
r2, r6
r7
;;
r2, 1
;;
r5
Assembly Example
ld8
sub
add
st8
add
st8
r2
r4
r5
[r4]
r2
[r2]
=
=
=
=
=
=
[r3]
r10, r11 ;;
r2, r6
r7
;;
r2, 1
;;
r5
IA-64 Instruction Format
128-Bit Bundle
Instruction 1
(41 bits)
Instruction 2
(41 bits)
Instruction 3
(41 bits)
Template
(5 bits)
41-Bit Instruction
Major Opcode
(4 bits)
Modifying Bits
(10 bits)
GR3
(7 bits)
GR2
(7 bits)
GR1
(7 bits)
PR
(6 bits)
Template Field
Template
Slot 1
Slot 2
Slot 3
Template
Slot 1
Slot 2
Slot 3
00000
M
I
I
01110
M
M
F
00001
M
I
I
01111
M
M
F
00010
M
I
I
10000
M
I
B
00011
M
I
I
10001
M
I
B
00100
M
L
X
10010
M
B
B
00101
M
L
X
10011
M
B
B
01000
M
M
I
10110
B
B
B
01001
M
M
I
10111
B
B
B
01010
M
M
I
11000
M
M
B
01011
M
M
I
11001
M
M
B
01100
M
F
I
11100
M
F
B
01101
M
F
I
11101
M
F
B
Branching on x86
if (G_LIKELY(random() != 1))
printf("not one");
call
cmp
je
mov
mov
call
if (G_UNLIKELY(random() != 1)) call
printf("not one");
cmp
jne
mov
leave
ret
8048440 <random@plt>
$0x1,%eax
8048524 <main+0x20>
$0x80485f0,%eax
%eax,(%esp)
8048410 <printf@plt>
8048440 <random@plt>
$0x1,%eax
8048524 <main+0x1B>
$0x0,%eax
Branching on IA-64
// random() -> r14
// not_ones -> r31
// ones
-> r32
if(random() != 1)
not_ones++;
else
ones++;
cmp.eq
(p1) adds
(p2) adds
p1,p2=1,r14
r31=1,r31
r32=1,r32
Data Speculation on IA-64
ld8.a r6
= [r8]
// other stuff
ld8.c r6
= [r8]
add
r5
= r6, r7
st8
[r18] = r5
;;
;;
Data Speculation on IA-64 (cont.)
ld8.a r6
= [r8]
// other stuff
add
r5
= r6, r7
// more stuff
chk.a r6, dirty
origin:
st8
[r18] = r5
dirty:
ld8.a r6
= [r8]
add
r5
= r6, r7
br
origin
;;
;;
;;
Data Speculation on x86
???
Rotating Register Stack
● r32-r127 can rotate ("register renaming")
● loop unrolling
● parameter passing
● overflows to memory
Performance
● Two bundles per cycle
○ Up to six instructions per cycle
○ Multiply-accumulate allows for 4 FLOPs per cycle
● Quad core
○ QPI (96 GiB/s)
○ Four memory controllers (34 GiB/s)
● Split L1 cache (16kiB Data, 16kiB Data)
● Unified L2 cache (256kiB)
● Unified L3 cache (24MiB)
Where do I buy one?
● $3,838 for the Tukwila 9350
● Servers in excess of $200,000
● newegg doesn't have them
Emulation
● ski
○ ski - ncurses-based IA-64 simulator
○ xski - ski with a GUI
○ http://ski.sourceforge.net/
● cross compile
○ ia64-gcc
○ ia64-as (live on the edge)
Questions?
Related documents