Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SPARC64™ VIIIfx: Fujitsu’s New Generation Octo Core Processor for PETA Scale computing August 25, 2009 Takumi Maruyama LSI Development Division Next Generation Technical Computing Unit Fujitsu Limited All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Fujitsu Processor Development Tr=760M CMOS Cu / 45nm Features HPC-ACE System On Chip SPARC64 ™ SPARC64™ Processor Processor Hardware Barrier High Performance Technology Multi-core / Multi-thread L2$ on Die Non-Blocking $ O-O-O Execution Super-Scalar Tr=190M CMOS Cu 130nm Tr=30M CMOS Cu 180nm / 150nm SPARC64 High Reliability Technology SPARC64™ VIIIfx Tr=400M CMOS Cu 90nm GS21 SPARC64 V GS21 900 Tr=46M CMOS Cu 180nm GS8800B GS8800 SPARC64 GS8600 1996 ~1997 Mainframe Mainframe SPARC64 V+ GS8900 SPARC64 II UNIX UNIX SPARC64 VII SPARC64 VI Tr=500M CMOS Cu 90nm GS21 600 Store Ahead Branch History Prefetch ~1995 Tr=540M CMOS Cu 90nm GP Single-chip CPU $ ECC Register/ALU Parity Instruction Retry $ Dynamic Degradation RC/RT/History Tr=600M CMOS Cu 65nm HPC HPC SPARC64 VIIIfx Tr=190M CMOS Cu 130nm Mainframe Mainframe Tr=30M CMOS Al 250nm / 220nm Tr=10M CMOS Al 350nm 1998 ~1999 2 2000 ~2003 2004 ~2007 2008~ All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ VIIIfx Design Target Processor for Fujitsu’s supercomputer for the peta-scale computing age, which realizes both high performance and low power Unachievable goal with conventional processor design • More GF (Giga Flops) • More Efficiency Focus of this presentation • No more high frequency Large ISA (Instruction Set Architecture) extension is required High Integration: SoC (System On Chip) High Reliability Reuse SPARC64TM VII design when applicable SPARC64™ VIIIfx 3 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 HPC-ACE (High Performance Computing - Arithmetic Computational Extensions) ISA (Instruction Set Architecture) of SPARC64VIIIfx is: Complies with • The SPARC-V9 standard • JPS (Joint Programmer’s Specification): Extension to SPARC-V9 HPC-ACE: Fujitsu’s unique ISA extension for HPC • Large register sets • SIMD (single instruction multiple data) instructions • etc You can download SPARC64TM VIIIfx ISA documents from http://jp.fujitsu.com/solutions/hpc/brochures/ The SPARC® Architecture Manual Version 9 SPARC® Joint Programming Specification (JPS1): Commonality SPARC64 VIIIfx Extensions SPARC64™ VIIIfx 4 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Large register sets 1/2 Register enhancements from V9 160192 INT registers 32256 FP registers (DP) • Lower 32 registers are the same with SPARC-V9’s. • FP registers are “flat”. Extended FP registers can be accessed with non-SIMD instruction as well. INT Reg Register 160 Window FP Reg V9 32 Ext. 32 V9 SIMD (basic) V9 Ext. 32 Ext. 32 224 SIMD (extended) Why needed To extract more parallelism currently limited by the number of Arch registers. Reduce spill/fill overhead Loop0 Large parallelism w/ loop unrolling Loop0 Loop1 LoopN Loop1 … Limited parallelism w/ O-O-O SPARC64™ VIIIfx 5 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Large register sets 2/2 Instruction format for 256 FP registers 8 bit x 4 (3 read +1 write) register number fields are necessary for FMA (Floating-point Multiply and Add) instruction. But SPARC-V9 instruction length is limited (32bits – fixed) Defined a new prefix instruction (SXAR) to specify upper-3bit of register numbers of the following two instructions. SXAR (Set XAR) instruction XAR: Extended Arithmetic Register • Set by the SXAR instruction • Valid bit is cleared once the corresponding subsequent instruction gets executed. fv fsimd furd 31 furs1 furs2 furs3 sv ssimd surd 1615 surs1 surs2 First Upper Register Source-1 bits surs3 0 SXAR inst1 inst2 SXAR1: set XAR for subsequent one instruction. SXAR2: set XAR for subsequent two instructions. SPARC64™ VIIIfx Upper-3bit x4 6 Lower-5bit x4 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SIMD 16byte Load x2 SIMD FP operation 1 SIMD FP instruction executes two SP or DP floating-point operations extended #0-31 #128-255 #32-127 SIMD Load/Store 1 SIMD load/store instruction accesses two contiguous SP or DP data in memory Data alignment requirement for SIMD-load (DP) is 8 byte rather than usual 16byte. Combine two independent FP operation into one with the Flat FP registers SXAR instruction Specifies SIMD operation of the subsequent two instructions. No new opcode is required for SIMD operation More SW optimization possibility with SIMD SPARC64™ VIIIfx basic 7 A-pipe SIMD B-pipe SIMD C-pipe D-pipe Conventional SIMD usage SIMD-Load (Src1) SIMD-Load (Src2) SIMD-FP SIMD-Store (Dst) Combine two FP ops with SIMD Load (Src1-A) Load (Src1-B) Load (Src2-A) Load (Src2-B) SIMD-FP Store (Dst-A) Store (Dst-B) All Rights Reserved,Copyright© FUJITSU LIMITED 2009 FP Trigonometric Functions Instructions for fast Trigonometric Function calculation Taylor series approximation of sin (x) ~= x – 1/3!x3 + 1/5!x5 – 1/7!x7 + 1/9!x9 – 1/11x11 + 1/13x13 - 1/15x15 = x ((((((((0 – 1/15!)x2 + 1/13!)x2 – 1/11!)x2 + 1/9!)x2 – 1/7!)x2 + 1/5!)x2 – 1/3!)x2 + 1) New instruction - ftrimaddd Function: rs1 × abs(rs2) + T[index] → rd Usage: #S=0 ftrimaddd S, X2, 7, S ftrimaddd S, X2, 6, S ftrimaddd S, X2, 5, S ftrimaddd S, X2 , 4, S ftrimaddd S, X2 , 3, S ftrimaddd S, X2 , 2, S ftrimaddd S, X2 , 1, S ftrimaddd S, X2 , 0, S Fmuld S, X, S SPARC64™ VIIIfx # S * X2 – 1/15! S # S * X2 + 1/13! S # S * X2 – 1/11! S # S * X2 + 1/9! S # S * X2 – 1/7! S # S * X2 + 1/5! S # S * X2 – 1/3! S # S * X2 + 1/1! S #S*XS 8 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Software controlled Cache Give SW ability to control cache to optimize performance while keeping cache coherency Divide Cache into 2 groups (sectors) SXAR instruction specifies the sector • sector 0: Instruction fetch / normal operand access (default) • sector 1: operand access explicitly specified by SXAR Sector cache configuration register Specifies the ratio of sector 0 and 1 at the same index L2 Cache structure Ex) Sector0:Sector1=30%:70% HW Implementation Keep sector info of each cache line. Select the way to be replaced to meet the sector ratio on cache misses SW specifies a sector depending on temporal and spatial locality of data SPARC64™ VIIIfx 9 Way0 Way9 0 0 0 1 1 1 1 1 1 1 index 0 1 1 0 1 1 1 1 0 1 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Other HPC-ACE features Conditional operation SIMD For efficient execution of Loop with ‘if’ Conditional branch can be removed with the following conditional instructions. for (i=0; i<00; i++) { if (A(i)>0) FPop X(i) else FPop Y(i) } FP Conditional Compare to Register Move Selected FP-register on FP-register’s condition Store FP-register on FP-register’s condition basic extended i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 time Compiler can optimize the loop with SW pipeline FP Reciprocal Approximation of Divide/Square-root Calculate reciprocal approximation with rounding error < 1/256 To achieve higher divide/square-root performance with pipelined operation FP Minimum and Maximum SPARC64™ VIIIfx 10 Usage: # %f2 / %f10 %f0 frcpad %f10, %f6 fmuld %f2, %f6, %f2 fnmsubd %f6, %f10, 1.0, %f6 fmuld %f6, %f6, %f0 fmaddd %f6, %f6, %f6, %f4 fmaddd %f0, %f0, %f6, %f0 fmaddd %f4, %f2, %f2, %f4 fmaddd %f0, %f4, %f2, %f0 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Integrated Multicore Parallel Architecture SPARC64TM VIIIfx HW HPC-ACE Shared L2 cache to avoid false sharing More possibility of optimization: Parallelize the innermost loop Hardware barrier for fast inter-core synchronization Fujitsu’s compiler technology SPARC64TM VIIIfx Automatic parallelization Vector Conventional Scalar DO J=1,N V DO I=1,M VA(J)=A(J)+A(I,J+1)*B(I,J) V END END 2 3 4 5 SPARC64™ VIIIfx P DO J=1,N DO I=1,M A(J)=A(J)+A(I,J+1)*B(I,J) END END Parallel I 1 2 3 4 ・・・ Time 1 Time I= P J= 1 2 3 4 5 CORE0 CORE2 CORE1 CORE3 11 Parallel J 1 2 3 4 ・・・ Time Vector J 1 2 3 4 ・・・ DO J=1,N P DO I=1,M P A(J)=A(J)+A(I,J+1)*B(I,J) P END END I= CORE0 CORE1 CORE2 SIMD basic extended basic extended All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ VIIIfx Chip Overview • Architecture Features • • • • HSIO L2$ Data Core4 MAC MAC Core6 L2$ Control Core1 Core0 Core7 MAC MAC Core3 L2$ Data DDR3 interface DDR3 interface Core5 8 cores Shared 5 MB L2$ Embedded Memory Controller 2 GHz • Fujitsu 45nm CMOS • 22.7mm x 22.6mm • 760M transistors • 1271 signal pins • Performance (peak) Core2 • 128GFlops • 64GB/s memory throughput • Power • 58W (TYP, 30℃) • Water Cooling – Low leakage power and High reliability SPARC64™ VIIIfx 12 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ VIIIfx Core spec L1D$ Execution Unit Instruction Set Architecture SPARC-V9/JPS + HPC-ACE #FP-operations 8 (= 4 Floating Multiply and Add) per clock FPR + FUB Instruction control L1$ control <integer> <floating-point> ALU x2 SHIFT x2 MULT x1 DIVIDE x1 AGEN x2 FMA x4 (2SIMD) COMPARE x2 #Registers 188 (GPR) 32 (GUB) 256 (FPR) 48 x2 (FUB) L1$ L1I$ 32KB/2way L1D$ 32KB/2way Execution units L1I$ SPARC64™ VIIIfx 13 DIVIDE x2 VIS x1 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64TM VII Pipeline Issue Dispatch Reg.-Read Execute Fetch (4stages) (4stages) (2stages) Memory Commit (L1$: 3(int)/4(fp)stages) (2stages) CSE 64Entry L1 I$ 64KB 2Way Decode & Issue RSA 10Entry GPR EAGA 16Entry EAGB Store Port 156Registers x2 EXA Branch Target Address 8Kentry 4Way RSE GUB 8x2Entry 48Registers RSF 8x2Entry FPR Fetch Port EXB PC x2 L1 D$ 64KB 2Way 8x2Entry Control Registers x2 Write Buffer 5x2Entry FLA 64Registers x2 FLB L2$ RSBR FUB 10Entry 48Registers 6MB 12Way System Bus Interface SPARC64™ VIIIfx 14 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64TM VIIIfx Pipeline Fetch Issue Dispatch Reg.-Read Execute (4stages) (3stages) (4(int)/5(fp) stages) Commit Memory (L1$: 3(int)/4(fp)stages) (2stages) CSE 48Entry L1 I$ 32KB 2Way Decode & Issue RSA GPR 10Entry 188Registers EAGA 20Entry EAGB Store Port EXA Branch Target Address 1Kentry 2Way RSE GUB 10Entry 32Registers Fetch Port EXB PC L1 D$ 32KB 2Way 8Entry Control Registers Write Buffer 5Entry FLA RSF FPR 8x2Entry 256Registers FLB L2$ FLC RSBR FUB 6Entry 48x2Registers 5MB 10Way FLD Memory Controller SPARC64™ VIIIfx DIMM 15 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Changes in the pipeline Issue stage E Added “PD” stage to handle SXAR 6-instructions packed into 4 at “PD” An CSE entry is assigned at “D” The packed instructions have Extended register fields SIMD operation attributes Execution Stage 16 B SXAR C A’ B’ C’ D D’ Decode basic extended FPR FUB FPR FUB B2 X SPARC64™ VIIIfx A P B “B” stage has been split into two for flat Floating-point Register access SXAR PD D 4 packed (=6 unpacked) instructions can be committed at the same time. Read IB Bypass FLA/B FLC/D All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Performance Results – Core Execution Time Relative to SPARC64TM [email protected] SPARC64TM [email protected] Hardware measured results 1.4 Performance = x 4.3 1.2 Benchmark: Euroben + in-house Performance = x 6.8 Performance = x 1.4 1 0.8 0 commit due to - $/system - Execution unit 1 commit 2-3 commit 4 commit 0.6 0.4 0.2 0 VII VIIIfx (noSIMD) VIIIfx Loop6: y(i) = x1(i) / x2(i) VII VIIIfx (noSIMD) VIIIfx VII y(i) = sin(x1(i)) VIIIfx (noSIMD) VIIIfx Loop14: 9th Degr Polyn SPARC64TM VIIIfx realizes much higher core performance than SPARC64TM VII thanks to HPC-ACE despite its lower frequency SPARC64™ VIIIfx 17 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Performance Results – Chip Execution Time Relative to SPARC64TM [email protected] SPARC64TM [email protected] Simulated results 1.4 1.2 Performance = x 2.4 Benchmark: Scientific Application - Molecular Dynamics - Fluid Dynamics Performance = x 2.6 1 0.8 0 commit due to - $/system - Execution unit 1 commit 2-3 commit 4 commit 0.6 0.4 0.2 0 SPARC64VII SPARC64VIIIfx Molecular Dynamics SPARC64VII SPARC64VIIIfx Fluid Dynamics SPARC64TM VIIIfx shows about x 2.5 performance of SPARC64TM VII Stall time due to the execution unit busy has been reduced dramatically. Expect x 3 performance with further compiler optimization SPARC64™ VIIIfx 18 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64™ History and Future Evolution rather than revolution Peak Performance & Power SPARC64TM V (1 core) • RAS • Single thread performance SPARC64TM VI (2 core x 2 VMT) • Throughput SPARC64TM VII (4 core x 2 SMT) • More Throughput • High Performance Computing SPARC64 V = 1.0 25 20 Performance (GF) =x3 15 10 Power (W) = x 0.5 5 0 V V+ VI SPARC64 SPARC64™ VIIIfx VII VIIIfx SPARC64TM VIIIfx (8 core) • High Performance Computing • Low Power • SoC 19 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64TM VIIIfx Summary SPARC64TM VIIIfx has been designed to be used for Fujitsu’s supercomputer for the PETA-scale computing age. HPC-ACE instruction sets overcomes the limitation of conventional SPARC-V9 architecture. SPARC64TM VIIIfx has combined high performance and low power. SPARC64TM VIIIfx chip is up and running in the lab. Fujitsu will continue to develop SPARC64TM series to meet the needs of a new era. SPARC64™ VIIIfx 20 All Rights Reserved,Copyright© FUJITSU LIMITED 2009 Abbreviations • SPARC64TM VIIIfx – – – – – – – – – – IB: Instruction Buffer RSA: Reservation Station for Address generation RSE: Reservation Station for Execution RSF: Reservation Station for Floating-point RSBR: Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update Buffer GPR: General Purpose Register FPR: Floating Point Register CSE: Commit Stack Entry SPARC64™ VIIIfx 21 All Rights Reserved,Copyright© FUJITSU LIMITED 2009