Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
SPARC64™ VIIIfx: Fujitsu’s New
Generation Octo Core Processor
for PETA Scale computing
August 25, 2009
Takumi Maruyama
LSI Development Division
Next Generation Technical Computing Unit
Fujitsu Limited
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Fujitsu Processor Development
Tr=760M
CMOS Cu / 45nm
Features
HPC-ACE
System On Chip
SPARC64
™
SPARC64™
Processor
Processor
Hardware Barrier
High Performance
Technology
Multi-core / Multi-thread
L2$ on Die
Non-Blocking $
O-O-O Execution
Super-Scalar
Tr=190M
CMOS Cu
130nm
Tr=30M
CMOS Cu
180nm / 150nm SPARC64
High Reliability
Technology
SPARC64™ VIIIfx
Tr=400M
CMOS Cu
90nm
GS21
SPARC64
V
GS21
900
Tr=46M
CMOS Cu
180nm
GS8800B
GS8800
SPARC64
GS8600
1996
~1997
Mainframe
Mainframe
SPARC64
V+
GS8900
SPARC64
II
UNIX
UNIX
SPARC64
VII
SPARC64
VI
Tr=500M
CMOS Cu
90nm
GS21
600
Store Ahead
Branch History
Prefetch
~1995
Tr=540M
CMOS Cu
90nm
GP
Single-chip CPU
$ ECC
Register/ALU Parity
Instruction Retry
$ Dynamic Degradation
RC/RT/History
Tr=600M
CMOS Cu
65nm
HPC
HPC
SPARC64
VIIIfx
Tr=190M
CMOS Cu
130nm
Mainframe
Mainframe
Tr=30M
CMOS Al
250nm / 220nm
Tr=10M
CMOS Al
350nm
1998
~1999
2
2000
~2003
2004
~2007
2008~
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ VIIIfx Design Target
Processor for Fujitsu’s supercomputer for the peta-scale
computing age, which realizes both high performance and
low power
Unachievable goal with conventional processor design
• More GF (Giga Flops)
• More Efficiency
Focus of this
presentation
• No more high frequency
Large ISA (Instruction Set Architecture)
extension is required
High Integration: SoC (System On Chip)
High Reliability
Reuse SPARC64TM VII design when applicable
SPARC64™ VIIIfx
3
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
HPC-ACE
(High Performance Computing - Arithmetic Computational Extensions)
ISA (Instruction Set Architecture) of SPARC64VIIIfx is:
Complies with
• The SPARC-V9 standard
• JPS (Joint Programmer’s Specification): Extension to SPARC-V9
HPC-ACE: Fujitsu’s unique ISA extension for HPC
• Large register sets
• SIMD (single instruction multiple data) instructions
• etc
You can download SPARC64TM VIIIfx ISA documents from
http://jp.fujitsu.com/solutions/hpc/brochures/
The SPARC® Architecture Manual Version 9
SPARC® Joint Programming Specification (JPS1): Commonality
SPARC64 VIIIfx Extensions
SPARC64™ VIIIfx
4
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Large register sets 1/2
Register enhancements from V9
160192 INT registers
32256 FP registers (DP)
• Lower 32 registers are the same
with SPARC-V9’s.
• FP registers are “flat”. Extended FP
registers can be accessed with
non-SIMD instruction as well.
INT Reg
Register
160 Window
FP Reg
V9
32
Ext.
32
V9
SIMD
(basic)
V9
Ext.
32
Ext.
32
224
SIMD
(extended)
Why needed
To extract more parallelism
currently limited by the number of
Arch registers.
Reduce spill/fill overhead
Loop0
Large parallelism
w/ loop unrolling
Loop0
Loop1
LoopN
Loop1
…
Limited parallelism
w/ O-O-O
SPARC64™ VIIIfx
5
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Large register sets 2/2
Instruction format for 256 FP registers
8 bit x 4 (3 read +1 write) register number fields are necessary for
FMA (Floating-point Multiply and Add) instruction.
But SPARC-V9 instruction length is limited (32bits – fixed)
Defined a new prefix instruction (SXAR) to specify upper-3bit of
register numbers of the following two instructions.
SXAR (Set XAR) instruction
XAR: Extended Arithmetic Register
• Set by the SXAR instruction
• Valid bit is cleared once the corresponding subsequent instruction
gets executed.
fv fsimd furd
31
furs1
furs2
furs3 sv ssimd surd
1615
surs1
surs2
First Upper Register Source-1 bits
surs3
0
SXAR
inst1
inst2
SXAR1: set XAR for subsequent one instruction.
SXAR2: set XAR for subsequent two instructions.
SPARC64™ VIIIfx
Upper-3bit x4
6
Lower-5bit x4
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SIMD
16byte Load x2
SIMD FP operation
1 SIMD FP instruction executes two SP or
DP floating-point operations
extended
#0-31
#128-255
#32-127
SIMD Load/Store
1 SIMD load/store instruction accesses two
contiguous SP or DP data in memory
Data alignment requirement for SIMD-load
(DP) is 8 byte rather than usual 16byte.
Combine two independent FP operation into
one with the Flat FP registers
SXAR instruction
Specifies SIMD operation of the
subsequent two instructions.
No new opcode is required for SIMD
operation
More SW optimization possibility with
SIMD
SPARC64™ VIIIfx
basic
7
A-pipe
SIMD
B-pipe SIMD
C-pipe
D-pipe
Conventional SIMD usage
SIMD-Load (Src1)
SIMD-Load (Src2)
SIMD-FP
SIMD-Store (Dst)
Combine two FP ops with SIMD
Load (Src1-A)
Load (Src1-B)
Load (Src2-A)
Load (Src2-B)
SIMD-FP
Store (Dst-A)
Store (Dst-B)
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
FP Trigonometric Functions
Instructions for fast Trigonometric Function calculation
Taylor series approximation of sin (x)
~= x – 1/3!x3 + 1/5!x5 – 1/7!x7 + 1/9!x9 – 1/11x11 + 1/13x13 - 1/15x15
= x ((((((((0 – 1/15!)x2 + 1/13!)x2 – 1/11!)x2 + 1/9!)x2 – 1/7!)x2 + 1/5!)x2 – 1/3!)x2 + 1)
New instruction - ftrimaddd
Function: rs1 × abs(rs2) + T[index] → rd
Usage:
#S=0
ftrimaddd S, X2, 7, S
ftrimaddd S, X2, 6, S
ftrimaddd S, X2, 5, S
ftrimaddd S, X2 , 4, S
ftrimaddd S, X2 , 3, S
ftrimaddd S, X2 , 2, S
ftrimaddd S, X2 , 1, S
ftrimaddd S, X2 , 0, S
Fmuld
S, X, S
SPARC64™ VIIIfx
# S * X2 – 1/15! S
# S * X2 + 1/13! S
# S * X2 – 1/11! S
# S * X2 + 1/9! S
# S * X2 – 1/7! S
# S * X2 + 1/5! S
# S * X2 – 1/3! S
# S * X2 + 1/1! S
#S*XS
8
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Software controlled Cache
Give SW ability to control cache to optimize performance
while keeping cache coherency
Divide Cache into 2 groups (sectors)
SXAR instruction specifies the sector
• sector 0: Instruction fetch / normal operand access (default)
• sector 1: operand access explicitly specified by SXAR
Sector cache configuration register
Specifies the ratio of sector 0 and 1 at the same index
L2 Cache structure
Ex) Sector0:Sector1=30%:70%
HW Implementation
Keep sector info of each cache line.
Select the way to be replaced to meet the
sector ratio on cache misses
SW specifies a sector depending on
temporal and spatial locality of data
SPARC64™ VIIIfx
9
Way0
Way9
0 0 0 1 1 1 1 1 1 1
index
0 1 1 0 1 1 1 1 0 1
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Other HPC-ACE features
Conditional operation
SIMD
For efficient execution of Loop with ‘if’
Conditional branch can be removed with the
following conditional instructions.
for (i=0; i<00; i++) {
if (A(i)>0)
FPop X(i)
else
FPop Y(i)
}
FP Conditional Compare to Register
Move Selected FP-register on FP-register’s condition
Store FP-register on FP-register’s condition
basic extended
i=0 i=1
i=2 i=3
i=4 i=5
i=6 i=7
time
Compiler can optimize the loop with SW pipeline
FP Reciprocal Approximation of
Divide/Square-root
Calculate reciprocal approximation with
rounding error < 1/256
To achieve higher divide/square-root
performance with pipelined operation
FP Minimum and Maximum
SPARC64™ VIIIfx
10
Usage:
# %f2 / %f10 %f0
frcpad %f10, %f6
fmuld
%f2, %f6, %f2
fnmsubd %f6, %f10, 1.0, %f6
fmuld
%f6, %f6, %f0
fmaddd %f6, %f6, %f6, %f4
fmaddd %f0, %f0, %f6, %f0
fmaddd %f4, %f2, %f2, %f4
fmaddd %f0, %f4, %f2, %f0
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Integrated Multicore Parallel Architecture
SPARC64TM VIIIfx HW
HPC-ACE
Shared L2 cache to avoid false sharing More possibility of optimization:
Parallelize the innermost loop
Hardware barrier for fast inter-core
synchronization
Fujitsu’s compiler technology
SPARC64TM VIIIfx
Automatic parallelization
Vector
Conventional Scalar
DO J=1,N
V DO I=1,M
VA(J)=A(J)+A(I,J+1)*B(I,J)
V END
END
2
3 4 5
SPARC64™ VIIIfx
P
DO J=1,N
DO I=1,M
A(J)=A(J)+A(I,J+1)*B(I,J)
END
END Parallel
I
1
2
3
4
・・・
Time
1
Time
I=
P
J= 1 2 3 4 5
CORE0 CORE2
CORE1 CORE3 11
Parallel
J
1
2
3
4
・・・
Time
Vector
J
1
2
3
4
・・・
DO J=1,N
P DO I=1,M
P A(J)=A(J)+A(I,J+1)*B(I,J)
P END
END
I=
CORE0 CORE1 CORE2
SIMD basic extended
basic extended
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ VIIIfx Chip Overview
• Architecture Features
•
•
•
•
HSIO
L2$ Data
Core4
MAC
MAC
Core6
L2$ Control
Core1
Core0
Core7
MAC
MAC
Core3
L2$ Data
DDR3 interface
DDR3 interface
Core5
8 cores
Shared 5 MB L2$
Embedded Memory Controller
2 GHz
• Fujitsu 45nm CMOS
• 22.7mm x 22.6mm
• 760M transistors
• 1271 signal pins
• Performance (peak)
Core2
• 128GFlops
• 64GB/s memory throughput
• Power
• 58W (TYP, 30℃)
• Water Cooling – Low leakage
power and High reliability
SPARC64™ VIIIfx
12
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ VIIIfx Core spec
L1D$
Execution Unit
Instruction Set
Architecture
SPARC-V9/JPS
+ HPC-ACE
#FP-operations 8 (= 4 Floating Multiply and Add)
per clock
FPR + FUB
Instruction control
L1$
control
<integer>
<floating-point>
ALU x2
SHIFT x2
MULT x1
DIVIDE x1
AGEN x2
FMA x4 (2SIMD)
COMPARE x2
#Registers
188 (GPR)
32 (GUB)
256 (FPR)
48 x2 (FUB)
L1$
L1I$ 32KB/2way
L1D$ 32KB/2way
Execution units
L1I$
SPARC64™ VIIIfx
13
DIVIDE x2
VIS x1
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64TM VII Pipeline
Issue Dispatch Reg.-Read Execute
Fetch
(4stages)
(4stages) (2stages)
Memory
Commit
(L1$: 3(int)/4(fp)stages) (2stages)
CSE
64Entry
L1 I$
64KB
2Way
Decode
& Issue
RSA
10Entry
GPR
EAGA
16Entry
EAGB
Store
Port
156Registers
x2
EXA
Branch
Target
Address
8Kentry
4Way
RSE
GUB
8x2Entry
48Registers
RSF
8x2Entry
FPR
Fetch
Port
EXB
PC
x2
L1 D$
64KB
2Way
8x2Entry
Control
Registers
x2
Write
Buffer
5x2Entry
FLA
64Registers
x2
FLB
L2$
RSBR
FUB
10Entry
48Registers
6MB
12Way
System Bus
Interface
SPARC64™ VIIIfx
14
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64TM VIIIfx Pipeline
Fetch
Issue Dispatch Reg.-Read Execute
(4stages) (3stages)
(4(int)/5(fp) stages)
Commit
Memory
(L1$: 3(int)/4(fp)stages) (2stages)
CSE
48Entry
L1 I$
32KB
2Way
Decode
& Issue
RSA
GPR
10Entry
188Registers
EAGA
20Entry
EAGB
Store
Port
EXA
Branch
Target
Address
1Kentry
2Way
RSE
GUB
10Entry
32Registers
Fetch
Port
EXB
PC
L1 D$
32KB
2Way
8Entry
Control
Registers
Write
Buffer
5Entry
FLA
RSF
FPR
8x2Entry
256Registers
FLB
L2$
FLC
RSBR
FUB
6Entry
48x2Registers
5MB
10Way
FLD
Memory
Controller
SPARC64™ VIIIfx
DIMM
15
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Changes in the pipeline
Issue stage
E
Added “PD” stage to handle SXAR
6-instructions packed into 4 at “PD”
An CSE entry is assigned at “D”
The packed instructions have
Extended register fields
SIMD operation attributes
Execution Stage
16
B SXAR C
A’
B’
C’
D
D’
Decode
basic
extended
FPR FUB
FPR FUB
B2
X
SPARC64™ VIIIfx
A
P
B
“B” stage has been split into two for flat
Floating-point Register access
SXAR
PD
D
4 packed (=6 unpacked) instructions can
be committed at the same time.
Read IB
Bypass
FLA/B
FLC/D
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Performance Results – Core
Execution Time
Relative to SPARC64TM [email protected]
SPARC64TM [email protected]
Hardware measured results
1.4
Performance
= x 4.3
1.2
Benchmark:
Euroben + in-house
Performance
= x 6.8
Performance
= x 1.4
1
0.8
0 commit due to
- $/system
- Execution unit
1 commit
2-3 commit
4 commit
0.6
0.4
0.2
0
VII
VIIIfx
(noSIMD)
VIIIfx
Loop6: y(i) = x1(i) / x2(i)
VII
VIIIfx
(noSIMD)
VIIIfx
VII
y(i) = sin(x1(i))
VIIIfx
(noSIMD)
VIIIfx
Loop14: 9th Degr Polyn
SPARC64TM VIIIfx realizes much higher core performance than
SPARC64TM VII thanks to HPC-ACE despite its lower frequency
SPARC64™ VIIIfx
17
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Performance Results – Chip
Execution Time
Relative to SPARC64TM [email protected]
SPARC64TM [email protected]
Simulated results
1.4
1.2
Performance
= x 2.4
Benchmark:
Scientific Application
- Molecular Dynamics
- Fluid Dynamics
Performance
= x 2.6
1
0.8
0 commit due to
- $/system
- Execution unit
1 commit
2-3 commit
4 commit
0.6
0.4
0.2
0
SPARC64VII
SPARC64VIIIfx
Molecular Dynamics
SPARC64VII
SPARC64VIIIfx
Fluid Dynamics
SPARC64TM VIIIfx shows about x 2.5 performance of SPARC64TM VII
Stall time due to the execution unit busy has been reduced dramatically.
Expect x 3 performance with further compiler optimization
SPARC64™ VIIIfx
18
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ History and Future
Evolution rather than revolution
Peak Performance & Power
SPARC64TM V (1 core)
• RAS
• Single thread performance
SPARC64TM VI (2 core x 2 VMT)
• Throughput
SPARC64TM VII (4 core x 2 SMT)
• More Throughput
• High Performance Computing
SPARC64 V = 1.0
25
20
Performance (GF)
=x3
15
10
Power (W)
= x 0.5
5
0
V
V+
VI
SPARC64
SPARC64™ VIIIfx
VII
VIIIfx
SPARC64TM VIIIfx (8 core)
• High Performance Computing
• Low Power
• SoC
19
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64TM VIIIfx Summary
SPARC64TM VIIIfx has been designed to be used for
Fujitsu’s supercomputer for the PETA-scale computing age.
HPC-ACE instruction sets overcomes the limitation of
conventional SPARC-V9 architecture.
SPARC64TM VIIIfx has combined high performance and low
power.
SPARC64TM VIIIfx chip is up and running in the lab.
Fujitsu will continue to develop SPARC64TM series to meet
the needs of a new era.
SPARC64™ VIIIfx
20
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Abbreviations
• SPARC64TM VIIIfx
–
–
–
–
–
–
–
–
–
–
IB:
Instruction Buffer
RSA: Reservation Station for Address generation
RSE: Reservation Station for Execution
RSF: Reservation Station for Floating-point
RSBR: Reservation Station for Branch
GUB: General Update Buffer
FUB: Floating point Update Buffer
GPR: General Purpose Register
FPR: Floating Point Register
CSE: Commit Stack Entry
SPARC64™ VIIIfx
21
All Rights Reserved,Copyright© FUJITSU LIMITED 2009