Download Massive Cores - ICR(ISAC CPU Research)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Exascale Computing
July 1, 2016
Sung Bae Park
ICR
Revision/2016. 6. 29/2016. 6. 30/2016. 7. 1
Outline
 Trend
 Direction
 Future
Trend
IT Waves
VR

Contents & Service: Rich, High Quality Ubiquitous Web & Real Life VR

Network & Device: Variety of Topology & Network Infra 5G
Multimedia
Software
Network
Device
3D
Computing & Memory
3D
VR
VR
Wide MV
Stereoscopic
Super MV
Multi-View
NotePad Hologram
Web3.0
Standalone
Web1.0
Web2.0
Web4.0
Computing
& Memory
IoT
Mobile Web
Smart Real World WebSmart Ubiquitous Web
TV
PC
Big Data
1G
2G
3GPhone 4G
5G
HSPA
MIMO AI PostAMPS
GSM
UPN*
OFDM
OFDM
(Analog)
CDMA
Computing & Memory
3D Printer
Single Core
Multi Core
HPP*
CPU
Multi
GPGPU*
Massive Core
ADV
CPU
DSP
Color
FHD
2D
* UPN: ULP Personal Networking
1D
* GPGPU: General Purpose GPU
1980
1990
2000
2010
* HPP: Hybrid Parallel Processor
2017
Exascale Computing
 More than Moore(100x in 10 Years, 2x/18M): 1000x for 10 Years, 2x/12M
※ 1GFLOPS(’00 CPU)  1TFLOPS(’10 GPU)  1PFLOPS(‘20)  1EFLOPS(‘30)
 Crisis in Power, Efficiency and SW
China Sunway
TaihuLight
FLOPS
RIKEN K
IBM Roadrunner
Supercomputer 2.0
GPU
NEC SX-3
1950
1960
CELL
DEC 21264
Supercomputer 1.0
ENIAC
GTX580
CPU
Processor
SW 26010
Xeon E5-2600
P100
CPU
GPU
Architecture
CPU
CPU
GPU
Company
Shanghai HPICDC Intel
Nvidia
i4004
Year
2016
2015
2015
Overloaded Pipeline
Peak in 1998 DEC 21264
Speed
1.45GHz
3.7GHz
1.5GHz
No more effective for IPC
1024
FPU
5%
Si
Budget
Out-Of-Order
# -of
Cores/Chip
26064 FPU 0.5% Si Budget
18
3584
- Non-Blocking Queue
640K
x
32b
2MB
RF
256 x 64b 2KB RF
- Precise Branch Predict 2.6T
FLOPS
1T
10T
Massive
Thread
Overloaded Pipeline*
1970
1980
1990
2000
2010
2020
Year
Direction
Crisis in Speed & Power
0.4V
 180nm 1GHz Samsung/DEC EV6 CPU in 1999, 14nm 4GHz in 2016 Yet!!!
 No power efficiency improvements with Si scaling due to unscalable VDD
Data Center Power & Cost
PPA
Crisis is Computing Efficiency
Memory
 Massive Cores on a Chip
※ Si Scaling Enables Integration Unit Changes: TR  Gate/Cell Array  Core Array
 More Cores on a Chip, Less Efficiency of Computing
Specific
Applications
Special Purpose Processor
-
DSP: Digital Signal Processor
GPU: Graphics Processor
NPU: Network Processor
NMP: Neuromorphic Processor
Dedicated Hardware
Cost
- FPGA
- Reconfigurable Systems
Domain Specific ISA
Massive Cores
Massive Cores on a Chip
- Massive CPU/GPU/DSP/HWs
- Run-Time-Reconfigurable
Massive Cores
Reconfigurable
Processor
Programmable Hardware
Array Processor
Homogeneous
Many Cores
Chip-Multiprocessor
Compiler
General
Applications
General Purpose Processor
- PC/Server CPU
- Mobile CPU
Cost
General Purpose Controller
- MCU
API
Crisis in SW
 Direct API for Minimum Memory Transfer
 EPIC begging the CPU-like GPU for Differentiation and Productivity
Programmable GPU: Market Begging
1 Core CPU
500 Core GPU
500 Core FPPA
Parallel
200
5.4
<6
Sequential
200
750
< 220
Easier HW beats Faster HW!
Future
Key Enabler for Next Wave
 x86 CPU: Highly Programmable & High Performance, but Power & Price
 ARM+HW SoC: Low Power & Price, Medium Performance, but Programmability
 Massive Cores: Highly Programmable & High Performance, Low Power & Price
Smart SoC $250B
HW SoC $100B
PC CPU $50B
Inflection
Point
1980
Drivers
Obstacles
1990
Creative Consumer
On Massive Cores
Nokia Phone
on ARM CPU
IBM PC
on Intel x86 CPU
2000
• x86 Binary Compatible
Mass Infra for IHV/ISV
2010
2017
• High Performance
3-4GHz 6-24 Core
• Data: Low Power Low Price
Dedicated HW IPs
• CPU: Low Power, Low Price
ARM Mass Infra for IHV/ISV
• Extreme P4 [Price, Power,
Performance & Programmability]
enabled by Massive Core based
RTR* FPPA
• Power ~100W
• Price ~$100
• Memory Bottleneck
• HW IP: No Programmability
• CPU/DSP: x10 Power, Price &
Memory Bottleneck than HW IPs
• *Run-Time-Reconfigurable
On-Chip Dynamic Compiler
On-Chip Kernel
• System SW as GCD, MapReduce
Smart SoC based on RTR FPPA
 Run-Time-Reconfigurable Heterogeneous Field Programmable Processor Array
 Run-Time-Reconfigurable 2D 3D Vector Accessible Memories
Fine-Thread CPU
• x86 / ARM
• On-Chip Dynamic Compiler
• On-Chip Kernel
Mid-Thread DSP
• SIMD / Vector
Massive-Thread GPU
• SIMT
X-Y Stack Register File
• Multi-GHz MB Wide IO
Reconfigurable Buses
• Low Swing Wide I/O
Reconfigurable Memories
• Multi-GHz GB Wide I/O
FPGA for Special IP & IO
• HDMI, Serdes,..
Design Methodology
• Structured Custom to SoC
• PM: PG/CG w/ DVFS
Tool Chains
• Integrated Compiler
• System Simulator
Seamless Platform
• Open OS to Std. Drivers
• OpenCL, MPI, GCD
• Total Solutions
Device
• 0.4V 3mA 1pA @14nm
Analog IPs
• Low Swing Bus Drv/Rec
• High-Q PLLs
Package
• 3D Integration
100mW 1TFLOPS for Exascale Computing in 2020
 More than Moore: Challenge to HW ASIC Level Massive Cores
Exa Flop @230KW
1/10,000
Power
Revolution in 10-years
FLOPS
1/100 from Scaling
100T
Additional 1/100 from Innovation in
HW like Run-Time-Reconfigurable Computing
Si Technology for 0.1V ELV Device & Circuits
Exa Flop @70MW
1/30 Power Efficiency in 5-years
1/3 from Computing, 1/10 from Scaling
Multi-GHz Multi-GB Reconfigurable Memory with 2D 3D Vectored Access
10T
Exa-byte/sec 3D Integration
Peta Flop @2.3MW
1T
GP
GPU
100G
HW
ASIC
PC
CPU
10G
Mobile
CPU
2020
1G
Moore’s Law
x2 / 18-months
(x10 / 5-years)
0.1
1
10
100
2015
2010
2005
1000 Watts
Acknowledgements
The author would like to thank Dan Dobberpuhl (Founder of SiByte, PASemi),
David Ditzel(Founder of Transmeta), Jim Keller(DEC EV 6 Chief Architect),
Anantha Chandrakasan (MIT), Dimitri Antoniadis (MIT), Li-Shiuan Peh (MIT),
Shekhar Borkar(Intel, Fellow), Le Nguyen(Founder of AIT),
Peter Song(Founder of Montalvo Systems) and Derek Lentz (GPU Architect) for their
valuable comments and advices to enable this presentation.
Appendix
Movie Quality Virtual Reality
Procedural Primitives
Traced Deep Shadow
Physically Plausible Shader
Organized Point Clouds
Fully SW Pipeline
Procedural
PQ Illumination
PQ Shader
CPU-like
Programmable
Random
Dynamic
Conditional RI Evaluation
Facevarying Class Specifier
Ambient Occlusion
True RiSphere Primitives
Blobby Implicit Surfaces
OpenGL/D3D API
Polygon
Rasterization
Modeling
GPU + HW  GPGPU-like
Fixed
 Programmable
Regular  Random
Static  Dynamic
Live Computer Vision
Wave
Kanizsa Triangle
0.4V 3.5mA/um MOSFET: Diffusion to Ballistic
Intel 14nm FinFET, 2014 IEDM
Beiking Univ. 9nm DG FET, 2015 IEEE EDSSC
0.4V 134W 14nm 42GHz A-CPU
v @Ecrit cm/s
Ecrit V/cm
3.00E+07
3.00E+04
i4004
L, m
1.00E-05
VDD
15
E V/m
1.50E+06
E V/cm
1.50E+04
VDD for Ecrit
3.00E+01
v = uE
9.00E+06
L, cm
1.00E-03
t = L/v
1.11E-10
Speed Ratio
9.99E-01
Expected Speed, Hz 7.49E+05
750KHz
Real Speed, Hz
7.50E+05
Design Up
1.00E+00
I Pentium IIEV6
2.50E-07 1.80E-07
2
1.65
8.00E+06 9.17E+06
8.00E+04 9.17E+04
7.50E-01 5.40E-01
3.00E+07 3.00E+07
2.50E-05 1.80E-05
8.33E-13 6.00E-13
1.33E+02 1.85E+02
9.99E+07 1.39E+08
100MHz 140MHz
2.00E+08 1.00E+09
2.00E+00 7.21E+00
1.00E+00
I Skylake
1.40E-08
1
7.14E+07
7.14E+05
4.20E-02
3.00E+07
1.40E-06
4.67E-14
2.38E+03
1.78E+09
1.8GHz
4.20E+09
2.35E+00
3.06E+00
1.5
0.666667
1
4.20E+09
A-CPU
1.40E-08
0.4
2.86E+07
2.86E+05
4.20E-02
3.00E+07
1.40E-06
4.67E-14
2.38E+03
1.78E+09
1.8GHz
40W
0.2
50nF
0.25nF
2V
80W
0.66
20nF
0.16nF
1V
134W
1.1
20nF
0.16nF
0.4V
Ion mA/um
t = CV/I
Final Target
Practical Target
Power (fCV2 )
W / mm2
C total
C / mm2
VDD
0.5W
0.04
3nF
0.25nF
15V
80W
0.67
30nF
0.25nF
1.65V
2.00E+00
3
0.133333
5
2.10E+10
6.43E+10
4.20E+10
0.2V 1mA/um MOSFET: Ballistic to Tunneling
Beiking Univ. 9nm DG FET, 2015 IEEE EDSSC
Chenmming Hu 40nm, 2008 VLSI-TSA
FQHE: Zero Resistance  Zero Power
Power
 Fractional Quantum Hall Effect @ Certain Magnetic Field
 “Sharp Resonance” as Impedance Matching and/or Superconductor
In 1980, Klaus von Klitzing [103] found that at temperatures of only a
few Kelvin and high magnetic ¯eld (3-10 Tesla), the Hall resistance did
not vary linearly with the Field. Instead, he found that it varied in a
stepwise fashion. It was also found that where the Hall resistance was °at,
the longitudinal resistance disappeared. This dissipation free transport
looked very similar to superconductivity. The Field at which the
plateaus appeared, or where the longitudinal resistance vanished,
quite surprisingly, was independent of the material, temperature, or
other variables of the experiment, but only depended on a
combination of fundamental constants -¹h=e2. The quantization of
resistivity seen in these early experiments came as a grand surprise and
would lead to a new international standard of resistivity, the Klitzing,
de¯ned as the Hall resistance measured at the fourth step.
By 1982, semiconductor technology had greatly advanced and it became
possible to produce interfaces of much higher quality than where
available only a few years before. That same year, Horst Stormer and
Dan Tsui [105] repeated Klitzing's earlier experiments with much cleaner
samples and higher magnetic ¯elds. What they found was the same
stepwise behavior as seen previously, but to everyone's surprise, steps
also appeared at fractional ¯lling factors º = 1=3; 1=5; 2=5 : : : Strongly
correlated systems are notoriously di±cult to understand, but in 1983,
Robert Laughlin [106] proposed his now celebrated ansatz for a
variational wavefunction which contained no free parameters:
[Cooper Pairs to Molecules: J. N. Milstein]
PPA Crisis: Learn from Dedicated HW IP
“Years of research in low-power embedded computing have shown
only one design technique to reduce power: reduce waste.”
- Mark Horowitz, Stanford University & Rambus Inc.
CPU
GPU
mCPU
DSP
Power (W)
60
80
0.6
0.24
0.12
0.015
Performance,
# of H.264
1
2
0.1
1
1
1
Area, mm2
200
400
10
3
2
0.5
PPA
8.3E-5
6.3E-5
1.6E-3
1.4
4.2
133.3
1
0.75
19
1.7E4
5.0E4
1.6E6
Reduce Power: Reduce Waste
Wasted Si Area
Wasted Computation
Wasted Bandwidth
Wasted Voltage
Wasted Design Resources
HW IP I HW IP II
Make HW IP Programmable: Reconfigurable Computing
 Reprogrammable FSM with Microcode + Domain Specific HW FU with ISA
 Extreme RISC in Horizontal Control, and Extreme CISC in Vertical Data
Radio ISA RC-FU
Advanced DSP design driven by
workload analysis
Data
- 4G/5G Modem
- Channel
Control
Latency
F0
F1
F2
Instruction
Fetch
Throughput
D1
D2
Instruction
Decode
Smart compiler
(C/C++)
E1
E2
E3
Tag
Match
AGU Access ing
E6
E7
FU0
ALU1 ALU2 ALU3
WB
MUL1 MUL2 SHFT
WB
I$
Instr.
Op.
Dec
Fetch
LS Pipeline
MUL1 MUL2 SHFT
DATA
VLIW
E5
FU1
I$
Instruction
E4
WB
LS Pipeline
Central RF (Register file)
FU
FU
FU
FU
FU
FU
FU
FU
RF
RF
RF
RF
FU
FU
FU
FU
RF
RF
RF
RF
FU
FU
FU
FU
RF
RF
RF
RF
Coarse Grain Array (CGA)
Media ISA RC-FU
- AV/Image
- 3D/Ray-Tracing/VR
Intellig. ISA RC-FU
- Recognition
- Mining
- Synthesis
WB
FU2
WB
Reconfigurable Memory
 HW IP’s Outstanding PPA come from Implicit, Distributed, Stacked Queue Memory
 Reconfigurable Memory for HW IP level PPA
H.264 Luma Inter Prediction Algorithm
Worst Case 16x16 mode (i position)
Position i
Vertical Filtering (6tab Filter, 21x16)
y = x0–5*x1+20*x2+20*x3–5*x4+x5
21x16 Pixel (16bit)
21x21 Pixel (8bit)
Horizontal Filtering (6tab Filter + Scaling, 17x16)
z = ((y0–5*y1+20*y2+20*y3–5*y4+y5 )+512)>>10
21x16 Pixel (16bit)
17x16 Pixel (8bit)
¼ pel (16x16)
r = (z0+z1+1)>>1
17x16 Pixel (8bit)
16x16 Pixel (8bit)
2D Reconfigurable Memory
 No Need to Calculate Address: Implicit/Local/Distributed 32x32x64-Bit RF
 X-Y Bi-Directional Random Access for Extreme Spatial Locality in Bit-Pixel
Stream Applications
X
H.264 FHD decoding – Lama Interpretation
X-Y
Stack
Vertical filtering
Horizontal filtering
Y
¼ peel
Total cycles
= ~18 cycles
Total cycles = 170
Load64b
add
Load64b
add
Load64b
add
Load64b
add
Load64b
add
Load64b
add
CAT_WIN
CAT_WIN
CAT_WIN
CAT_WIN
Data
Shuffling
LOAD from stack;
Load from
Stack
Store64b
add
Store64b
add
Data Store &
address
generation
Data
Computation
Data load &
address
generation
Loop 예제: II=2, loop count=16
Round Sat
Round Sat
FirFilter64b
FirFilter64b
Round Sat
Avg64b
FirFilter64b
FirFilter64b
Round Sat
Avg64b
NONE;
Data
STORE to stack;
Computation
Data Computation
Store to Stack
3D Reconfigurable Memory
 Computer Vision, Virtual Reality, AI need all Depth Processing
Pixel to Voxel Processing
Run-Time-Reconfigurable Computing
PPA
 On-Chip Run-Time Compiler & Kernel
 One of the Biggest Overhead has been Reconfiguration Memory PPA
 3D XPoint can accelerate RTR FPPA (Run-Time-Reconfigurable Field
Programmable Processor Array)
New York Univ., 2011 IEEE CVPR
GPU Case
 5120 Core x 1GHz X 2 FLOP/Core = 10 TFLOPS
Hide the Memory Latency
 Massive Thread in GPU hide Long Latency but MUCH Limited Way!
 Good for Massive Thread Applications but Very MUCH Damaging for
CSP and/or Random Control Flow and/or Less Massive Parallel
Thread Applications
Crisis in Memory Latency
 CPU-like work: Small Thread
- Detail, Sophisticated Rendering
- Ray Tracing
- Procedural Graphics
※ Random-Dynamic-Irregular
Control Flow & Data Access
※ Severe Slow Down in PC/Mobile GPUs
 Mobile GPU-like work: Med. Thread
- Some Repetitive, Some Detail
- Mid. Size of Regualr Pattern
※ Half Fixed-Static-Regular
Half Random-Dynamic-Irregular
Control Flow & Data Access
※ Optimized for Mobile GPUs
 PC GPU-like work: Massive Thread
- Repetitive, Simple Rendering
- Large Chunk of Regualr Pattern
- Rasterized Modeling
※ Fixed-Static-Regular
Control Flow & Data Access
※ Outstanding Speed Up in PC GPUs
Random Thread Variations
On-Chip DRAM
Memory
 GHz Random Access On-Chip DRAM, but 5x Larger Si Area than commercial
Samsung 40nm 2Gb, 2011 ISSCC: 25Mb/mm2
IBM 45nm SOI eDRAM, 2010 CICC: 5Mb/mm2
Direct API: 10  1 File Copy for 10x PPA
SW
Make Minimize the Memory Access by Freeway to Compute
Algorithm  Applications  UI  Many APIs  OS  Drivers  Cores
along with On-Chip API MMU to Minimize File Copy and Transfer (‘13.9 AMD Mantle GPU)
Related documents