Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Intel Itanium Architecture Alex Crawford Matt Ofalt Brief History ● Merced - 2001 ○ Slower than competing RISC and CISC ● McKinley (Itanium 2) - 2002 ○ Fixed many of the performance problems on Merced ● Montecito (Itanium 2 9000) - 2006 ○ Dual-core, roughly doubled performance ● Tukwila (Itanium 2 9300) - 2010 ○ Quad-core, memory error correction ○ Shares its chipset with Nehalem Itanium Overview ● 64-bit (path, data, address space) ● Explicit instruction-level parallelism (VLIW) ○ Static "superscaling" ● Compiler ○ Predication ○ Speculation ○ Branch Prediction ● 128 integer registers, 128 FP registers ● 30 functional execution units Compilers ● Very difficult to write ○ Predication ○ Speculation ○ Branch Prediction ● This is the reason the architecture is failing, but... ● Allows for huge improvements ● We like assembly better anyway, right? IA-64 Instructions ● Issued in 128-bit "bundles" ● Three 41-bit instructions per bundle ● Template tells CPU which instructions execute in parallel ○ Not constrained to just one bundle (8 inst. in parallel) ● Six instruction types ○ ○ ○ ○ ○ ○ A I M F B X Integer ALU Non-ALU integer Memory Floating-point Branch Extended I/M unit I unit M unit F unit B unit I/B unit Execution Units ● I-Unit ○ Integer arithmetic ○ Shift and add ○ Logical ● M-Unit ○ Load and Store ○ Basic integer ALU operations ● B-Unit ○ Branches ● F-Unit ○ Floating point IA-64 Assembly [pq] mnemonic [.comp] dest = src [;;] [//] (p0) cmp.eq p1,p2=5,r7 // conditional 5 == r7 pq - 1-bit predicate register mnemonic - name of instruction comp - instruction completer dest - one or more destination operands src - one or more source operands ;; - instruction group stops // - comment Assembly Example ld8 sub add st8 add st8 r2 r4 r5 [r4] r2 [r2] = = = = = = [r3] r10, r11 ;; r2, r6 r7 ;; r2, 1 ;; r5 Assembly Example ld8 sub add st8 add st8 r2 r4 r5 [r4] r2 [r2] = = = = = = [r3] r10, r11 ;; r2, r6 r7 ;; r2, 1 ;; r5 IA-64 Instruction Format 128-Bit Bundle Instruction 1 (41 bits) Instruction 2 (41 bits) Instruction 3 (41 bits) Template (5 bits) 41-Bit Instruction Major Opcode (4 bits) Modifying Bits (10 bits) GR3 (7 bits) GR2 (7 bits) GR1 (7 bits) PR (6 bits) Template Field Template Slot 1 Slot 2 Slot 3 Template Slot 1 Slot 2 Slot 3 00000 M I I 01110 M M F 00001 M I I 01111 M M F 00010 M I I 10000 M I B 00011 M I I 10001 M I B 00100 M L X 10010 M B B 00101 M L X 10011 M B B 01000 M M I 10110 B B B 01001 M M I 10111 B B B 01010 M M I 11000 M M B 01011 M M I 11001 M M B 01100 M F I 11100 M F B 01101 M F I 11101 M F B Branching on x86 if (G_LIKELY(random() != 1)) printf("not one"); call cmp je mov mov call if (G_UNLIKELY(random() != 1)) call printf("not one"); cmp jne mov leave ret 8048440 <random@plt> $0x1,%eax 8048524 <main+0x20> $0x80485f0,%eax %eax,(%esp) 8048410 <printf@plt> 8048440 <random@plt> $0x1,%eax 8048524 <main+0x1B> $0x0,%eax Branching on IA-64 // random() -> r14 // not_ones -> r31 // ones -> r32 if(random() != 1) not_ones++; else ones++; cmp.eq (p1) adds (p2) adds p1,p2=1,r14 r31=1,r31 r32=1,r32 Data Speculation on IA-64 ld8.a r6 = [r8] // other stuff ld8.c r6 = [r8] add r5 = r6, r7 st8 [r18] = r5 ;; ;; Data Speculation on IA-64 (cont.) ld8.a r6 = [r8] // other stuff add r5 = r6, r7 // more stuff chk.a r6, dirty origin: st8 [r18] = r5 dirty: ld8.a r6 = [r8] add r5 = r6, r7 br origin ;; ;; ;; Data Speculation on x86 ??? Rotating Register Stack ● r32-r127 can rotate ("register renaming") ● loop unrolling ● parameter passing ● overflows to memory Performance ● Two bundles per cycle ○ Up to six instructions per cycle ○ Multiply-accumulate allows for 4 FLOPs per cycle ● Quad core ○ QPI (96 GiB/s) ○ Four memory controllers (34 GiB/s) ● Split L1 cache (16kiB Data, 16kiB Data) ● Unified L2 cache (256kiB) ● Unified L3 cache (24MiB) Where do I buy one? ● $3,838 for the Tukwila 9350 ● Servers in excess of $200,000 ● newegg doesn't have them Emulation ● ski ○ ski - ncurses-based IA-64 simulator ○ xski - ski with a GUI ○ http://ski.sourceforge.net/ ● cross compile ○ ia64-gcc ○ ia64-as (live on the edge) Questions?