Download VIIIfx - Fujitsu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SPARC64™ VIIIfx: Fujitsu’s New
Generation Octo Core Processor
for PETA Scale computing
August 25, 2009
Takumi Maruyama
LSI Development Division
Next Generation Technical Computing Unit
Fujitsu Limited
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Fujitsu Processor Development
Tr=760M
CMOS Cu / 45nm
Features
HPC-ACE
System On Chip
SPARC64
™
SPARC64™
Processor
Processor
Hardware Barrier
High Performance
Technology
Multi-core / Multi-thread
L2$ on Die
Non-Blocking $
O-O-O Execution
Super-Scalar
Tr=190M
CMOS Cu
130nm
Tr=30M
CMOS Cu
180nm / 150nm SPARC64
High Reliability
Technology
SPARC64™ VIIIfx
Tr=400M
CMOS Cu
90nm
GS21
SPARC64
V
GS21
900
Tr=46M
CMOS Cu
180nm
GS8800B
GS8800
SPARC64
GS8600
1996
~1997
Mainframe
Mainframe
SPARC64
V+
GS8900
SPARC64
II
UNIX
UNIX
SPARC64
VII
SPARC64
VI
Tr=500M
CMOS Cu
90nm
GS21
600
Store Ahead
Branch History
Prefetch
~1995
Tr=540M
CMOS Cu
90nm
GP
Single-chip CPU
$ ECC
Register/ALU Parity
Instruction Retry
$ Dynamic Degradation
RC/RT/History
Tr=600M
CMOS Cu
65nm
HPC
HPC
SPARC64
VIIIfx
Tr=190M
CMOS Cu
130nm
Mainframe
Mainframe
Tr=30M
CMOS Al
250nm / 220nm
Tr=10M
CMOS Al
350nm
1998
~1999
2
2000
~2003
2004
~2007
2008~
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ VIIIfx Design Target
 Processor for Fujitsu’s supercomputer for the peta-scale
computing age, which realizes both high performance and
low power
 Unachievable goal with conventional processor design
• More GF (Giga Flops)
• More Efficiency
Focus of this
presentation
• No more high frequency
Large ISA (Instruction Set Architecture)
extension is required
 High Integration: SoC (System On Chip)
 High Reliability
 Reuse SPARC64TM VII design when applicable
SPARC64™ VIIIfx
3
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
HPC-ACE
(High Performance Computing - Arithmetic Computational Extensions)
 ISA (Instruction Set Architecture) of SPARC64VIIIfx is:
 Complies with
• The SPARC-V9 standard
• JPS (Joint Programmer’s Specification): Extension to SPARC-V9
 HPC-ACE: Fujitsu’s unique ISA extension for HPC
• Large register sets
• SIMD (single instruction multiple data) instructions
• etc
 You can download SPARC64TM VIIIfx ISA documents from
http://jp.fujitsu.com/solutions/hpc/brochures/
 The SPARC® Architecture Manual Version 9
 SPARC® Joint Programming Specification (JPS1): Commonality
 SPARC64 VIIIfx Extensions
SPARC64™ VIIIfx
4
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Large register sets 1/2
 Register enhancements from V9
 160192 INT registers
 32256 FP registers (DP)
• Lower 32 registers are the same
with SPARC-V9’s.
• FP registers are “flat”. Extended FP
registers can be accessed with
non-SIMD instruction as well.
INT Reg
Register
160 Window
FP Reg
V9
32
Ext.
32
V9
SIMD
(basic)
V9
Ext.
32
Ext.
32
224
SIMD
(extended)
 Why needed
 To extract more parallelism
currently limited by the number of
Arch registers.
 Reduce spill/fill overhead
Loop0
Large parallelism
w/ loop unrolling
Loop0
Loop1
LoopN
Loop1
…
Limited parallelism
w/ O-O-O
SPARC64™ VIIIfx
5
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Large register sets 2/2
 Instruction format for 256 FP registers
 8 bit x 4 (3 read +1 write) register number fields are necessary for
FMA (Floating-point Multiply and Add) instruction.
 But SPARC-V9 instruction length is limited (32bits – fixed)
 Defined a new prefix instruction (SXAR) to specify upper-3bit of
register numbers of the following two instructions.
SXAR (Set XAR) instruction
 XAR: Extended Arithmetic Register
• Set by the SXAR instruction
• Valid bit is cleared once the corresponding subsequent instruction
gets executed.
fv fsimd furd
31
furs1
furs2
furs3 sv ssimd surd
1615
surs1
surs2
First Upper Register Source-1 bits
surs3
0
SXAR
inst1
inst2
 SXAR1: set XAR for subsequent one instruction.
 SXAR2: set XAR for subsequent two instructions.
SPARC64™ VIIIfx
Upper-3bit x4
6
Lower-5bit x4
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SIMD
16byte Load x2
 SIMD FP operation
 1 SIMD FP instruction executes two SP or
DP floating-point operations
extended
#0-31
#128-255
#32-127
 SIMD Load/Store
 1 SIMD load/store instruction accesses two
contiguous SP or DP data in memory
 Data alignment requirement for SIMD-load
(DP) is 8 byte rather than usual 16byte.
 Combine two independent FP operation into
one with the Flat FP registers
 SXAR instruction
 Specifies SIMD operation of the
subsequent two instructions.
 No new opcode is required for SIMD
operation
 More SW optimization possibility with
SIMD
SPARC64™ VIIIfx
basic
7
A-pipe
SIMD
B-pipe SIMD
C-pipe
D-pipe
Conventional SIMD usage
SIMD-Load (Src1)
SIMD-Load (Src2)
SIMD-FP
SIMD-Store (Dst)
Combine two FP ops with SIMD
Load (Src1-A)
Load (Src1-B)
Load (Src2-A)
Load (Src2-B)
SIMD-FP
Store (Dst-A)
Store (Dst-B)
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
FP Trigonometric Functions
 Instructions for fast Trigonometric Function calculation
Taylor series approximation of sin (x)
~= x – 1/3!x3 + 1/5!x5 – 1/7!x7 + 1/9!x9 – 1/11x11 + 1/13x13 - 1/15x15
= x ((((((((0 – 1/15!)x2 + 1/13!)x2 – 1/11!)x2 + 1/9!)x2 – 1/7!)x2 + 1/5!)x2 – 1/3!)x2 + 1)
New instruction - ftrimaddd
 Function: rs1 × abs(rs2) + T[index] → rd
 Usage:
#S=0
ftrimaddd S, X2, 7, S
ftrimaddd S, X2, 6, S
ftrimaddd S, X2, 5, S
ftrimaddd S, X2 , 4, S
ftrimaddd S, X2 , 3, S
ftrimaddd S, X2 , 2, S
ftrimaddd S, X2 , 1, S
ftrimaddd S, X2 , 0, S
Fmuld
S, X, S
SPARC64™ VIIIfx
# S * X2 – 1/15!  S
# S * X2 + 1/13!  S
# S * X2 – 1/11!  S
# S * X2 + 1/9!  S
# S * X2 – 1/7!  S
# S * X2 + 1/5!  S
# S * X2 – 1/3!  S
# S * X2 + 1/1!  S
#S*XS
8
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Software controlled Cache
 Give SW ability to control cache to optimize performance
while keeping cache coherency
 Divide Cache into 2 groups (sectors)
 SXAR instruction specifies the sector
• sector 0: Instruction fetch / normal operand access (default)
• sector 1: operand access explicitly specified by SXAR
 Sector cache configuration register
Specifies the ratio of sector 0 and 1 at the same index
L2 Cache structure
Ex) Sector0:Sector1=30%:70%
 HW Implementation
 Keep sector info of each cache line.
 Select the way to be replaced to meet the
sector ratio on cache misses
SW specifies a sector depending on
temporal and spatial locality of data
SPARC64™ VIIIfx
9
Way0
Way9
0 0 0 1 1 1 1 1 1 1
index
0 1 1 0 1 1 1 1 0 1
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Other HPC-ACE features
 Conditional operation
SIMD
 For efficient execution of Loop with ‘if’
 Conditional branch can be removed with the
following conditional instructions.
for (i=0; i<00; i++) {
if (A(i)>0)
FPop X(i)
else
FPop Y(i)
}
 FP Conditional Compare to Register
 Move Selected FP-register on FP-register’s condition
 Store FP-register on FP-register’s condition
basic extended
i=0 i=1
i=2 i=3
i=4 i=5
i=6 i=7
time
Compiler can optimize the loop with SW pipeline
 FP Reciprocal Approximation of
Divide/Square-root
 Calculate reciprocal approximation with
rounding error < 1/256
 To achieve higher divide/square-root
performance with pipelined operation
 FP Minimum and Maximum
SPARC64™ VIIIfx
10
Usage:
# %f2 / %f10  %f0
frcpad %f10, %f6
fmuld
%f2, %f6, %f2
fnmsubd %f6, %f10, 1.0, %f6
fmuld
%f6, %f6, %f0
fmaddd %f6, %f6, %f6, %f4
fmaddd %f0, %f0, %f6, %f0
fmaddd %f4, %f2, %f2, %f4
fmaddd %f0, %f4, %f2, %f0
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Integrated Multicore Parallel Architecture
 SPARC64TM VIIIfx HW
 HPC-ACE
 Shared L2 cache to avoid false sharing  More possibility of optimization:
Parallelize the innermost loop
 Hardware barrier for fast inter-core
synchronization
 Fujitsu’s compiler technology
SPARC64TM VIIIfx
 Automatic parallelization
Vector
Conventional Scalar
DO J=1,N
V DO I=1,M
VA(J)=A(J)+A(I,J+1)*B(I,J)
V END
END
2
3 4 5
SPARC64™ VIIIfx
P
DO J=1,N
DO I=1,M
A(J)=A(J)+A(I,J+1)*B(I,J)
END
END Parallel
I
1
2
3
4
・・・
Time
1
Time
I=
P
J= 1 2 3 4 5
CORE0 CORE2
CORE1 CORE3 11
Parallel
J
1
2
3
4
・・・
Time
Vector
J
1
2
3
4
・・・
DO J=1,N
P DO I=1,M
P A(J)=A(J)+A(I,J+1)*B(I,J)
P END
END
I=
CORE0 CORE1 CORE2
SIMD basic extended
basic extended
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ VIIIfx Chip Overview
• Architecture Features
•
•
•
•
HSIO
L2$ Data
Core4
MAC
MAC
Core6
L2$ Control
Core1
Core0
Core7
MAC
MAC
Core3
L2$ Data
DDR3 interface
DDR3 interface
Core5
8 cores
Shared 5 MB L2$
Embedded Memory Controller
2 GHz
• Fujitsu 45nm CMOS
• 22.7mm x 22.6mm
• 760M transistors
• 1271 signal pins
• Performance (peak)
Core2
• 128GFlops
• 64GB/s memory throughput
• Power
• 58W (TYP, 30℃)
• Water Cooling – Low leakage
power and High reliability
SPARC64™ VIIIfx
12
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ VIIIfx Core spec
L1D$
Execution Unit
Instruction Set
Architecture
SPARC-V9/JPS
+ HPC-ACE
#FP-operations 8 (= 4 Floating Multiply and Add)
per clock
FPR + FUB
Instruction control
L1$
control
<integer>
<floating-point>
ALU x2
SHIFT x2
MULT x1
DIVIDE x1
AGEN x2
FMA x4 (2SIMD)
COMPARE x2
#Registers
188 (GPR)
32 (GUB)
256 (FPR)
48 x2 (FUB)
L1$
L1I$ 32KB/2way
L1D$ 32KB/2way
Execution units
L1I$
SPARC64™ VIIIfx
13
DIVIDE x2
VIS x1
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64TM VII Pipeline
Issue Dispatch Reg.-Read Execute
Fetch
(4stages)
(4stages) (2stages)
Memory
Commit
(L1$: 3(int)/4(fp)stages) (2stages)
CSE
64Entry
L1 I$
64KB
2Way
Decode
& Issue
RSA
10Entry
GPR
EAGA
16Entry
EAGB
Store
Port
156Registers
x2
EXA
Branch
Target
Address
8Kentry
4Way
RSE
GUB
8x2Entry
48Registers
RSF
8x2Entry
FPR
Fetch
Port
EXB
PC
x2
L1 D$
64KB
2Way
8x2Entry
Control
Registers
x2
Write
Buffer
5x2Entry
FLA
64Registers
x2
FLB
L2$
RSBR
FUB
10Entry
48Registers
6MB
12Way
System Bus
Interface
SPARC64™ VIIIfx
14
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64TM VIIIfx Pipeline
Fetch
Issue Dispatch Reg.-Read Execute
(4stages) (3stages)
(4(int)/5(fp) stages)
Commit
Memory
(L1$: 3(int)/4(fp)stages) (2stages)
CSE
48Entry
L1 I$
32KB
2Way
Decode
& Issue
RSA
GPR
10Entry
188Registers
EAGA
20Entry
EAGB
Store
Port
EXA
Branch
Target
Address
1Kentry
2Way
RSE
GUB
10Entry
32Registers
Fetch
Port
EXB
PC
L1 D$
32KB
2Way
8Entry
Control
Registers
Write
Buffer
5Entry
FLA
RSF
FPR
8x2Entry
256Registers
FLB
L2$
FLC
RSBR
FUB
6Entry
48x2Registers
5MB
10Way
FLD
Memory
Controller
SPARC64™ VIIIfx
DIMM
15
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Changes in the pipeline
 Issue stage




E
Added “PD” stage to handle SXAR
6-instructions packed into 4 at “PD”
An CSE entry is assigned at “D”
The packed instructions have
 Extended register fields
 SIMD operation attributes
 Execution Stage
16
B SXAR C
A’
B’
C’
D
D’
Decode
basic
extended
FPR FUB
FPR FUB
B2
X
SPARC64™ VIIIfx
A
P
B
 “B” stage has been split into two for flat
Floating-point Register access
SXAR
PD
D
 4 packed (=6 unpacked) instructions can
be committed at the same time.
Read IB
Bypass
FLA/B
FLC/D
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Performance Results – Core
Execution Time
Relative to SPARC64TM [email protected]
SPARC64TM [email protected]
Hardware measured results
1.4
Performance
= x 4.3
1.2
Benchmark:
Euroben + in-house
Performance
= x 6.8
Performance
= x 1.4
1
0.8
0 commit due to
- $/system
- Execution unit
1 commit
2-3 commit
4 commit
0.6
0.4
0.2
0
VII
VIIIfx
(noSIMD)
VIIIfx
Loop6: y(i) = x1(i) / x2(i)
VII
VIIIfx
(noSIMD)
VIIIfx
VII
y(i) = sin(x1(i))
VIIIfx
(noSIMD)
VIIIfx
Loop14: 9th Degr Polyn
 SPARC64TM VIIIfx realizes much higher core performance than
SPARC64TM VII thanks to HPC-ACE despite its lower frequency
SPARC64™ VIIIfx
17
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Performance Results – Chip
Execution Time
Relative to SPARC64TM [email protected]
SPARC64TM [email protected]
Simulated results
1.4
1.2
Performance
= x 2.4
Benchmark:
Scientific Application
- Molecular Dynamics
- Fluid Dynamics
Performance
= x 2.6
1
0.8
0 commit due to
- $/system
- Execution unit
1 commit
2-3 commit
4 commit
0.6
0.4
0.2
0
SPARC64VII
SPARC64VIIIfx
Molecular Dynamics
SPARC64VII
SPARC64VIIIfx
Fluid Dynamics
 SPARC64TM VIIIfx shows about x 2.5 performance of SPARC64TM VII

Stall time due to the execution unit busy has been reduced dramatically.
 Expect x 3 performance with further compiler optimization
SPARC64™ VIIIfx
18
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ History and Future
Evolution rather than revolution
Peak Performance & Power
 SPARC64TM V (1 core)
• RAS
• Single thread performance
 SPARC64TM VI (2 core x 2 VMT)
• Throughput
 SPARC64TM VII (4 core x 2 SMT)
• More Throughput
• High Performance Computing
SPARC64 V = 1.0
25
20
Performance (GF)
=x3
15
10
Power (W)
= x 0.5
5
0
V
V+
VI
SPARC64
SPARC64™ VIIIfx
VII
VIIIfx
SPARC64TM VIIIfx (8 core)
• High Performance Computing
• Low Power
• SoC
19
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64TM VIIIfx Summary
 SPARC64TM VIIIfx has been designed to be used for
Fujitsu’s supercomputer for the PETA-scale computing age.
 HPC-ACE instruction sets overcomes the limitation of
conventional SPARC-V9 architecture.
 SPARC64TM VIIIfx has combined high performance and low
power.
 SPARC64TM VIIIfx chip is up and running in the lab.
 Fujitsu will continue to develop SPARC64TM series to meet
the needs of a new era.
SPARC64™ VIIIfx
20
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
Abbreviations
• SPARC64TM VIIIfx
–
–
–
–
–
–
–
–
–
–
IB:
Instruction Buffer
RSA: Reservation Station for Address generation
RSE: Reservation Station for Execution
RSF: Reservation Station for Floating-point
RSBR: Reservation Station for Branch
GUB: General Update Buffer
FUB: Floating point Update Buffer
GPR: General Purpose Register
FPR: Floating Point Register
CSE: Commit Stack Entry
SPARC64™ VIIIfx
21
All Rights Reserved,Copyright© FUJITSU LIMITED 2009