Download VIIIfx - Fujitsu

SPARC64™ VIIIfx: Fujitsu’s New Generation Octo Core Processor for PETA Scale computing August 25, 2009 Takumi Maruyama LSI Development Division Next Generation Technical Computing Unit Fujitsu Limited All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Fujitsu Processor Development Tr=760M CMOS Cu / 45nm Features HPC-ACE System On Chip SPARC64 ™ SPARC64™ Processor Processor Hardware Barrier High Performance Technology Multi-core / Multi-thread L2$ on Die Non-Blocking $ O-O-O Execution Super-Scalar Tr=190M CMOS Cu 130nm Tr=30M CMOS Cu 180nm / 150nm SPARC64 High Reliability Technology SPARC64™ VIIIfx Tr=400M CMOS Cu 90nm GS21 SPARC64 V GS21 900 Tr=46M CMOS Cu 180nm GS8800B GS8800 SPARC64 GS8600 1996 ～1997 Mainframe Mainframe SPARC64 V+ GS8900 SPARC64 II UNIX UNIX SPARC64 VII SPARC64 VI Tr=500M CMOS Cu 90nm GS21 600 Store Ahead Branch History Prefetch ～1995 Tr=540M CMOS Cu 90nm GP Single-chip CPU $ ECC Register/ALU Parity Instruction Retry $ Dynamic Degradation RC/RT/History Tr=600M CMOS Cu 65nm HPC HPC SPARC64 VIIIfx Tr=190M CMOS Cu 130nm Mainframe Mainframe Tr=30M CMOS Al 250nm / 220nm Tr=10M CMOS Al 350nm 1998 ～1999 2 2000 ～2003 2004 ～2007 2008～ All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ VIIIfx Design Target  Processor for Fujitsu’s supercomputer for the peta-scale computing age, which realizes both high performance and low power  Unachievable goal with conventional processor design • More GF (Giga Flops) • More Efficiency Focus of this presentation • No more high frequency Large ISA (Instruction Set Architecture) extension is required  High Integration: SoC (System On Chip)  High Reliability  Reuse SPARC64TM VII design when applicable SPARC64™ VIIIfx 3 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 HPC-ACE (High Performance Computing - Arithmetic Computational Extensions)  ISA (Instruction Set Architecture) of SPARC64VIIIfx is:  Complies with • The SPARC-V9 standard • JPS (Joint Programmer’s Specification): Extension to SPARC-V9  HPC-ACE: Fujitsu’s unique ISA extension for HPC • Large register sets • SIMD (single instruction multiple data) instructions • etc  You can download SPARC64TM VIIIfx ISA documents from http://jp.fujitsu.com/solutions/hpc/brochures/  The SPARC® Architecture Manual Version 9  SPARC® Joint Programming Specification (JPS1): Commonality  SPARC64 VIIIfx Extensions SPARC64™ VIIIfx 4 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Large register sets 1/2  Register enhancements from V9  160192 INT registers  32256 FP registers (DP) • Lower 32 registers are the same with SPARC-V9’s. • FP registers are “flat”. Extended FP registers can be accessed with non-SIMD instruction as well. INT Reg Register 160 Window FP Reg V9 32 Ext. 32 V9 SIMD (basic) V9 Ext. 32 Ext. 32 224 SIMD (extended)  Why needed  To extract more parallelism currently limited by the number of Arch registers.  Reduce spill/fill overhead Loop0 Large parallelism w/ loop unrolling Loop0 Loop1 LoopN Loop1 … Limited parallelism w/ O-O-O SPARC64™ VIIIfx 5 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Large register sets 2/2  Instruction format for 256 FP registers  8 bit x 4 (3 read +1 write) register number fields are necessary for FMA (Floating-point Multiply and Add) instruction.  But SPARC-V9 instruction length is limited (32bits – fixed)  Defined a new prefix instruction (SXAR) to specify upper-3bit of register numbers of the following two instructions. SXAR (Set XAR) instruction  XAR: Extended Arithmetic Register • Set by the SXAR instruction • Valid bit is cleared once the corresponding subsequent instruction gets executed. fv fsimd furd 31 furs1 furs2 furs3 sv ssimd surd 1615 surs1 surs2 First Upper Register Source-1 bits surs3 0 SXAR inst1 inst2  SXAR1: set XAR for subsequent one instruction.  SXAR2: set XAR for subsequent two instructions. SPARC64™ VIIIfx Upper-3bit x4 6 Lower-5bit x4 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SIMD 16byte Load x2  SIMD FP operation  1 SIMD FP instruction executes two SP or DP floating-point operations extended #0-31 #128-255 #32-127  SIMD Load/Store  1 SIMD load/store instruction accesses two contiguous SP or DP data in memory  Data alignment requirement for SIMD-load (DP) is 8 byte rather than usual 16byte.  Combine two independent FP operation into one with the Flat FP registers  SXAR instruction  Specifies SIMD operation of the subsequent two instructions.  No new opcode is required for SIMD operation  More SW optimization possibility with SIMD SPARC64™ VIIIfx basic 7 A-pipe SIMD B-pipe SIMD C-pipe D-pipe Conventional SIMD usage SIMD-Load (Src1) SIMD-Load (Src2) SIMD-FP SIMD-Store (Dst) Combine two FP ops with SIMD Load (Src1-A) Load (Src1-B) Load (Src2-A) Load (Src2-B) SIMD-FP Store (Dst-A) Store (Dst-B) All Rights Reserved,Copyright© FUJITSU LIMITED 2009 FP Trigonometric Functions  Instructions for fast Trigonometric Function calculation Taylor series approximation of sin (x) ~= x – 1/3!x3 + 1/5!x5 – 1/7!x7 + 1/9!x9 – 1/11x11 + 1/13x13 - 1/15x15 = x ((((((((0 – 1/15!)x2 + 1/13!)x2 – 1/11!)x2 + 1/9!)x2 – 1/7!)x2 + 1/5!)x2 – 1/3!)x2 + 1) New instruction - ftrimaddd  Function: rs1 × abs(rs2) + T[index] → rd  Usage: #S=0 ftrimaddd S, X2, 7, S ftrimaddd S, X2, 6, S ftrimaddd S, X2, 5, S ftrimaddd S, X2 , 4, S ftrimaddd S, X2 , 3, S ftrimaddd S, X2 , 2, S ftrimaddd S, X2 , 1, S ftrimaddd S, X2 , 0, S Fmuld S, X, S SPARC64™ VIIIfx # S * X2 – 1/15!  S # S * X2 + 1/13!  S # S * X2 – 1/11!  S # S * X2 + 1/9!  S # S * X2 – 1/7!  S # S * X2 + 1/5!  S # S * X2 – 1/3!  S # S * X2 + 1/1!  S #S*XS 8 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Software controlled Cache  Give SW ability to control cache to optimize performance while keeping cache coherency  Divide Cache into 2 groups (sectors)  SXAR instruction specifies the sector • sector 0: Instruction fetch / normal operand access (default) • sector 1: operand access explicitly specified by SXAR  Sector cache configuration register Specifies the ratio of sector 0 and 1 at the same index L2 Cache structure Ex) Sector0:Sector1=30%:70%  HW Implementation  Keep sector info of each cache line.  Select the way to be replaced to meet the sector ratio on cache misses SW specifies a sector depending on temporal and spatial locality of data SPARC64™ VIIIfx 9 Way0 Way9 0 0 0 1 1 1 1 1 1 1 index 0 1 1 0 1 1 1 1 0 1 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Other HPC-ACE features  Conditional operation SIMD  For efficient execution of Loop with ‘if’  Conditional branch can be removed with the following conditional instructions. for (i=0; i<00; i++) { if (A(i)>0) FPop X(i) else FPop Y(i) }  FP Conditional Compare to Register  Move Selected FP-register on FP-register’s condition  Store FP-register on FP-register’s condition basic extended i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 time Compiler can optimize the loop with SW pipeline  FP Reciprocal Approximation of Divide/Square-root  Calculate reciprocal approximation with rounding error < 1/256  To achieve higher divide/square-root performance with pipelined operation  FP Minimum and Maximum SPARC64™ VIIIfx 10 Usage: # %f2 / %f10  %f0 frcpad %f10, %f6 fmuld %f2, %f6, %f2 fnmsubd %f6, %f10, 1.0, %f6 fmuld %f6, %f6, %f0 fmaddd %f6, %f6, %f6, %f4 fmaddd %f0, %f0, %f6, %f0 fmaddd %f4, %f2, %f2, %f4 fmaddd %f0, %f4, %f2, %f0 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Integrated Multicore Parallel Architecture  SPARC64TM VIIIfx HW  HPC-ACE  Shared L2 cache to avoid false sharing  More possibility of optimization: Parallelize the innermost loop  Hardware barrier for fast inter-core synchronization  Fujitsu’s compiler technology SPARC64TM VIIIfx  Automatic parallelization Vector Conventional Scalar DO J=1,N V DO I=1,M VA(J)=A(J)+A(I,J+1)*B(I,J) V END END 2 3 4 5 SPARC64™ VIIIfx P DO J=1,N DO I=1,M A(J)=A(J)+A(I,J+1)*B(I,J) END END Parallel I 1 2 3 4 ・・・ Time 1 Time I= P J= 1 2 3 4 5 CORE0 CORE2 CORE1 CORE3 11 Parallel J 1 2 3 4 ・・・ Time Vector J 1 2 3 4 ・・・ DO J=1,N P DO I=1,M P A(J)=A(J)+A(I,J+1)*B(I,J) P END END I= CORE0 CORE1 CORE2 SIMD basic extended basic extended All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ VIIIfx Chip Overview • Architecture Features • • • • HSIO L2$ Data Core4 MAC MAC Core6 L2$ Control Core1 Core0 Core7 MAC MAC Core3 L2$ Data DDR3 interface DDR3 interface Core5 8 cores Shared 5 MB L2$ Embedded Memory Controller 2 GHz • Fujitsu 45nm CMOS • 22.7mm x 22.6mm • 760M transistors • 1271 signal pins • Performance (peak) Core2 • 128GFlops • 64GB/s memory throughput • Power • 58W (TYP, 30℃) • Water Cooling – Low leakage power and High reliability SPARC64™ VIIIfx 12 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ VIIIfx Core spec L1D$ Execution Unit Instruction Set Architecture SPARC-V9/JPS + HPC-ACE #FP-operations 8 (= 4 Floating Multiply and Add) per clock FPR + FUB Instruction control L1$ control <integer> <floating-point> ALU x2 SHIFT x2 MULT x1 DIVIDE x1 AGEN x2 FMA x4 (2SIMD) COMPARE x2 #Registers 188 (GPR) 32 (GUB) 256 (FPR) 48 x2 (FUB) L1$ L1I$ 32KB/2way L1D$ 32KB/2way Execution units L1I$ SPARC64™ VIIIfx 13 DIVIDE x2 VIS x1 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64TM VII Pipeline Issue Dispatch Reg.-Read Execute Fetch (4stages) (4stages) (2stages) Memory Commit (L1$: 3(int)/4(fp)stages) (2stages) CSE 64Entry L1 I$ 64KB 2Way Decode & Issue RSA 10Entry GPR EAGA 16Entry EAGB Store Port 156Registers x2 EXA Branch Target Address 8Kentry 4Way RSE GUB 8x2Entry 48Registers RSF 8x2Entry FPR Fetch Port EXB PC x2 L1 D$ 64KB 2Way 8x2Entry Control Registers x2 Write Buffer 5x2Entry FLA 64Registers x2 FLB L2$ RSBR FUB 10Entry 48Registers 6MB 12Way System Bus Interface SPARC64™ VIIIfx 14 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64TM VIIIfx Pipeline Fetch Issue Dispatch Reg.-Read Execute (4stages) (3stages) (4(int)/5(fp) stages) Commit Memory (L1$: 3(int)/4(fp)stages) (2stages) CSE 48Entry L1 I$ 32KB 2Way Decode & Issue RSA GPR 10Entry 188Registers EAGA 20Entry EAGB Store Port EXA Branch Target Address 1Kentry 2Way RSE GUB 10Entry 32Registers Fetch Port EXB PC L1 D$ 32KB 2Way 8Entry Control Registers Write Buffer 5Entry FLA RSF FPR 8x2Entry 256Registers FLB L2$ FLC RSBR FUB 6Entry 48x2Registers 5MB 10Way FLD Memory Controller SPARC64™ VIIIfx DIMM 15 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Changes in the pipeline  Issue stage     E Added “PD” stage to handle SXAR 6-instructions packed into 4 at “PD” An CSE entry is assigned at “D” The packed instructions have  Extended register fields  SIMD operation attributes  Execution Stage 16 B SXAR C A’ B’ C’ D D’ Decode basic extended FPR FUB FPR FUB B2 X SPARC64™ VIIIfx A P B  “B” stage has been split into two for flat Floating-point Register access SXAR PD D  4 packed (=6 unpacked) instructions can be committed at the same time. Read IB Bypass FLA/B FLC/D All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Performance Results – Core Execution Time Relative to SPARC64TM [email protected] SPARC64TM [email protected] Hardware measured results 1.4 Performance = x 4.3 1.2 Benchmark: Euroben + in-house Performance = x 6.8 Performance = x 1.4 1 0.8 0 commit due to - $/system - Execution unit 1 commit 2-3 commit 4 commit 0.6 0.4 0.2 0 VII VIIIfx (noSIMD) VIIIfx Loop6: y(i) = x1(i) / x2(i) VII VIIIfx (noSIMD) VIIIfx VII y(i) = sin(x1(i)) VIIIfx (noSIMD) VIIIfx Loop14: 9th Degr Polyn  SPARC64TM VIIIfx realizes much higher core performance than SPARC64TM VII thanks to HPC-ACE despite its lower frequency SPARC64™ VIIIfx 17 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Performance Results – Chip Execution Time Relative to SPARC64TM [email protected] SPARC64TM [email protected] Simulated results 1.4 1.2 Performance = x 2.4 Benchmark: Scientific Application - Molecular Dynamics - Fluid Dynamics Performance = x 2.6 1 0.8 0 commit due to - $/system - Execution unit 1 commit 2-3 commit 4 commit 0.6 0.4 0.2 0 SPARC64VII SPARC64VIIIfx Molecular Dynamics SPARC64VII SPARC64VIIIfx Fluid Dynamics  SPARC64TM VIIIfx shows about x 2.5 performance of SPARC64TM VII  Stall time due to the execution unit busy has been reduced dramatically.  Expect x 3 performance with further compiler optimization SPARC64™ VIIIfx 18 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ History and Future Evolution rather than revolution Peak Performance & Power  SPARC64TM V (1 core) • RAS • Single thread performance  SPARC64TM VI (2 core x 2 VMT) • Throughput  SPARC64TM VII (4 core x 2 SMT) • More Throughput • High Performance Computing SPARC64 V = 1.0 25 20 Performance (GF) =x3 15 10 Power (W) = x 0.5 5 0 V V+ VI SPARC64 SPARC64™ VIIIfx VII VIIIfx SPARC64TM VIIIfx (8 core) • High Performance Computing • Low Power • SoC 19 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64TM VIIIfx Summary  SPARC64TM VIIIfx has been designed to be used for Fujitsu’s supercomputer for the PETA-scale computing age.  HPC-ACE instruction sets overcomes the limitation of conventional SPARC-V9 architecture.  SPARC64TM VIIIfx has combined high performance and low power.  SPARC64TM VIIIfx chip is up and running in the lab.  Fujitsu will continue to develop SPARC64TM series to meet the needs of a new era. SPARC64™ VIIIfx 20 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Abbreviations • SPARC64TM VIIIfx – – – – – – – – – – IB: Instruction Buffer RSA: Reservation Station for Address generation RSE: Reservation Station for Execution RSF: Reservation Station for Floating-point RSBR: Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update Buffer GPR: General Purpose Register FPR: Floating Point Register CSE: Commit Stack Entry SPARC64™ VIIIfx 21 All Rights Reserved,Copyright© FUJITSU LIMITED 2009

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download VIIIfx - Fujitsu