Download Parallel Processing Research at the Point of Inflection

Document related concepts
no text concepts found
Transcript
HTMT-class Latency Tolerant
Parallel Architecture for
Petaflops-scale Computation
Dr. Thomas Sterling
California Institute of Technology
and
NASA Jet Propulsion Laboratory
October 1, 1999
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
3
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
4
Rational
Drug Design
Nanotechnology
Biomolecular
Dynamics
Fracture
Mechanics
Crystallography
Diffraction
Inversion
Problems
Atomic
Scattering
Condensed Matter
Electronic Structure
Population
Genetics
Transportation
Systems
Plasma
Processing
Chemical
Reactors
Cloud Physics
Carlo
Raster
Graphics
Pattern
Matching
Neutron
Transport
Boilers
Multimedia
Collaboration
Tools
Scientific
Visualization
Chemical
Reactors
ODE
Structural Mechanics
Weather and Climate
Seismic
Processing
Multibody
Dynamics
Geophysical
Fluids
Aerodynamics
Fields
Ecosystems
Economics
Models
Orbital
Mechanics
Astrophysics
Electromagnetics
Intelligent
Search
Computer
Algebra
Databases
Magnet Design
Data Minning
CAD
Intelligent
Dr.
Thomas
Agents
Automated
Deduction
CVD
Multiphase Flow
Cryptography
Computer
Vision
Virtual
Prototypes
PDE
Symbolic
Processing
Genome
Processing
Virtual
Reality
Reaction-Diffusion
CFD
Basic
Algorithms
&
Numerical
Methods
Monte
Nuclear Structure
Radiation
Graph
Theoretic
Transport
Discrete
Events
Air Traffic
Control
Computational
Steering
Flow in
Porous Media
Pipeline Flows
n-body
Economics
5/23/2017
Electrical Grids
Signal
Processing
Reservoir
Modelling
Biosphere/Geosphere
Distribution Networks
Fourier
Methods
VLSI
Design
QCD
Neural Networks
Combustion
Quantum
Chemistry
Manufacturing
Systems
Military
Logistics
Data
Assimilation
Electronic
Structure
Actinide
Chemistry
Cosmology
Astrophysics
Phylogenetic Trees
MRI Imaging
Molecular
Modelling
Chemical
Dynamics
Tomographic
Reconstruction
Number Theory
Sterling - HTMT
Petaflops Architecture
6
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
7
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
8
A 10 Gflops
Beowulf
Center for
Advance
Computing
Research
172 Intel
Pentium Pro
microprocessors
California Institute
of Technology
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
9
Emergence of Beowulf Clusters
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
10
1st printing: May, 1999
2nd printing: Aug. 1999
MIT Press
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
11
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
12
Beowulf
Scalability
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
13
INTEGRATED SMP - WDM
DRAM - 4 GBYTES - HIGHLY INTERLEAVED
MULTI-LAMBDA
AON
CROSS BAR
coherence
2nd LEVEL CACHE
640 GBYTES/SEC
2nd LEVEL CACHE
96 MBYTES
96 MBYTES
64 bytes wide
160 gbytes/sec
VLIW/RISC CORE
24 GFLOPS
6 ghz
...
64 bytes wide
160 gbytes/sec
VLIW/RISC CORE
24 GFLOPS
6 ghz
COTS PetaFlop System
3
2
4
5
128 die/box
4 CPU/die
...
16
1
17
64
ALL-OPTICAL
SWITCH
63
...
18
...
32
49
48
47
I/O
... 33 Multi-Die
Multi-Processor
46
10 meters= 50 NS Delay
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
15
COTS PetaFlops System
•
•
•
•
•
•
•
•
•
•
8192 Dies (4 CPU/die-minimum)
Each Die is 120 GFlops
1 PetaFlop Peak
Power 8192 x200 Watts = 1.6 MegaWatts
Extra Main Memory >3 MegaWatts (512 TBytes)
15.36 TFlops/Rack (128 die)
30 KWatts/Rack - thus 64 racks - 30 inch
Common System I/O
2 Level Main Memory
Optical Interconnect
– OC768 Channels (40 GHz)
– 128 Channels per Die (DWDM)-5.12 THz
– ALL Optical Switching
• Bisection Bandwidth of 50 TBytes/sec
– 15 TFlops/rack*.1bytes/flop/sec*32 racks
•5/23/2017
Rack Bandwidth - 15 TFlops*.1=
12 THz
Dr. Thomas Sterling
- HTMT
Petaflops Architecture
16
The SIA CMOS Roadmap
100,000
MB per DRAM Chip
Logic Transistors per Chip (M)
uP Clock (MHz)
10,000
1,000
100
10
2012
2009
2006
2003
2001
1999
1997
1
Year of Technology Availability
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
17
Requirements for High End
Systems
• Bulk capabilities
–
–
–
–
performance
storage capacities
throughput/bandwidth
cost, power, complexity
• Efficiency
–
–
–
–
overhead
latency
contention
starvation/parallelism
• Usability
– generality
– programmability
– reliability
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
18
Points of Inflection in the History
of Computing
• Heroic Era (1950)
–
–
–
–
technology: vacuum tubes, mercury delay lines, pulse transformers
architecture: accumulator based
model: von-Neumann, sequential instruction execution
examples: Whirlwind, EDSAC
• Mainframe (1960)
–
–
–
–
5/23/2017
technology: transistors, core memory, disk drives
architecture: register bank based
model: virtual memory
examples: IBM 7090, PDP-1
Dr. Thomas Sterling - HTMT
Petaflops Architecture
19
Points of Inflection in the History
of Computing
• Supercomputers (1980)
–
–
–
–
technology: ECL, semiconductor integration, RAM
architecture: pipelined
model: vector
example: Cray-1
• Massively Parallel Processing (1990)
–
–
–
–
technology: VLSI, microprocessor,
architecture: MIMD
model: Communicating Sequential Processes, Message passing
examples: TMC CM-5, Intel Paragon
• ? (2000)
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
20
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
21
HTMT Objectives
• Scalable architecture with high sustained
performance in the presence of disparate cycle times
and latencies
• Exploit diverse device technologies to achieve
substantially superior operating point
• Execution model to simplify parallel system
programming and expand generality and applicability
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
22
Hybrid Technology MultiThreaded
Architecture
3D
Mem
• Compress/Decompress
• Spectral Transforms
DRAM
PIM
OPTICAL SWITCH
SRAM
PIM
• Data Structure
Initializations
•“In the Memory”
Operations
5/23/2017
RSFQ
Nodes
Dr. Thomas Sterling - HTMT
Petaflops Architecture
I/O FARM
• Compress/Decompress
• ECC/Redundancy
• Compress/Decompress
• Routing
• RSFQ Thread Management
• Context Percolation
• Scatter/Gather Indexing
• Pointer chasing
• Push/Pull Closures
• Synchronization Activities
23
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
24
Storage Capacity by Subsystem
2007 Design Point
5/23/2017
Subsystem
Unit Storage
# of Units
Total Storage
CRAM
32 KB
16 K
512 MB
SRAM
64 MB
16 K
1 TB
DRAM
512 MB
32 K
16 TB
HRAM
10 GB
128 K
1 PB
Primary Disk
100 GB
100 K
10 PB
Secondary Disk
100 GB
100 K
10 PB
Tape
1 TB
6Kx20
120 PB
Dr. Thomas Sterling - HTMT
Petaflops Architecture
25
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
26
HTMT Strategy
• High performance
– Superconductor RSFQ logic
– Data Vortex optical interconnect network
– PIM smart memory
• Low power
– Superconductor RSFQ logic
– Optical holographic storage
– PIM smart memory
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
27
HTMT Strategy (cont)
• Low cost
– reduce wire count through chip-to-chip fiber
– reduce processor count through x100 clock speed
– reduce memory chips by 3-2 holographic memory layer
• Efficiency
– processor level multithreading
– smart memory managed second stage context pushing
multithreading
– fine grain regular & irregular data parallelism exploited in memory
– high memory bandwidth and low latency ops through PIM
– memory to memory interactions without processor intervention
– hardware mechanisms for synchronization, scheduling,
data/context migration, gather/scatter
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
28
HTMT Strategy (cont)
• Programmability
– Global shared name space
– hierarchical parallel thread flow control model
• no explicit processor naming
– automatic latency management
• automatic processor load balancing
• runtime fine grain multithreading
• automatic context pushing for process migration (percolation)
– configuration transparent, runtime scalable
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
29
RSFQ
Roadmap
(VLSI Circuit
Clock
Frequency)
1 THz
high-Tc (65-77 K)
??
0.25 um
0.4 um
low-Tc (4-5 K)
0.8 um
100 GHz
1.5 um
3.5 um
10 GHz
??
optical lithgraphy
1 GHz
0.07 um
0.13 um
e-beam lithgraphy
0.25 um
(SIA Forecast)
100MHz
1998
2001
2004
2007
2010
Year
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
30
RSFQ Building Block
L1
JJ1
5/23/2017
JJ2
Dr. Thomas Sterling - HTMT
Petaflops Architecture
31
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
32
Advantages
•
•
•
•
•
X100 clock speeds achievable
X100 power efficiency advantage
Easier fabrication
Leverage semiconductor fabrication tools
First technology to encounter ultra-high speed
operation
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
33
Superconductor
Processor
•
•
•
•
•
100 GHz clock, 33 GHz inter-chip
0.8 micron Niobium on Silicon
100K gates per chip
0.05 watts per processor
100Kwatts per Petaflops
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
34
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
35
FUNCTIONALITY AND CAPABILITY
(1 petaflops machine, Yr. 2006, design COOL-0)
1. Technology Assumptions
(a) chip
Min JJ size
Min runner width
Nb layers
Junction density
5/23/2017
Runner pitch (in 1 layer)
Chip size
Contact Pin Pitch
0.8 m
1.5 m
8+1 (4 wires)
1M/cm2 logic
3M/cm2 memory
5 m
22 cm2
100100 m2
(b) CMCM
Size
Nb layers
Runner width
Runner pitch (in 1 layer)
2020 cm2
4+1 (2 wires)
3 m
8 m
(c) CPCB
Size
Metallic layers
Runner pitch
54 cm (max diam)
10+1 (5 wires)
100 m
Dr. Thomas Sterling - HTMT
Petaflops Architecture
36
6. COOL 0 System as a Whole
SPELLs Total
4K
12K chips
40 BJJs
4 Gbytes
16K chips
160 BJJs
24,576 nodes
2K chips
8 BJJs
CRAM Total
CNET Total
COOL 0 Grand Total
512 CMCMs
160
CPCBs
I/O Bandwidth
Physical Size
Dissipated Power
Refrigeration Power
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
1.0 Pflops
4 Pbytes/s
0.5 m3
250 W @ 4 K
100 kW
37
Data Vortex Optical
Interconnect
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
38
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
39
DATA VORTEX LATENCY DISTRIBUTION
network height = 1024
number of messages
120x10
3
22% active input ports
100
80
100% active input ports
60
40
20
0
0
5/23/2017
20
40
60
number of hops
Dr. Thomas Sterling - HTMT
Petaflops Architecture
80
100
40
Single-mode rib waveguides
on silicon-on-insulator wafers‡
Optical
mode
SiO2
cladding
Hybrid sources and detectors
Buried
oxide
Mix of CMOS-like and
‘micromachining’-type
processes for fabrication
Si
‡ e.g:
R A Soref, J Schmidtchen & K Petermann,
IEEE J. Quantum Electron. 27 p1971 (1991)
Si substrate
A Rickman, G T Reed, B L Weiss & F Navamar,
IEEE Photonics Technol. Lett. 4 p.633 (1992)
B Jalali, P D Trinh, S Yegnanarayanan & F
Coppinger
IEE Proc. Optoelectron. 143 p.307 (1996)
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
41
PIM Provides Smart Memory
Sense Amps
Memory
Stack
Memory
Stack
Sense Amps
Sense Amps
Memory
Stack
Memory
Stack
Sense Amps
Sense Amps
Decode
Sense Amps
Basic
Node Logic Silicon
Sense Amps
Sense Amps
Macro
Single
Chip
5/23/2017
• Merge logic and memory
• Integrate multiple logic/mem
stacks on single chip
• Exposes high intrinsic
memory bandwidth
• Reduction of memory access
latency
• Low overhead for memory
oriented operations
• Manages data structure
manipulation, context
coordination and percolation
Dr. Thomas Sterling - HTMT
Petaflops Architecture
42
Multithreaded PIM DRAM
•
•
•
•
•
•
Multithreaded Control of PIM Functions
multiple operation sequences with low context switching overhead
maximize memory utilization and efficiency
maximize processor and I/O utilization
multiple banks of row
buffers to hold data,
Boolean ALU
instructions, and addr
Memory
Row
Registers
data parallel basic
Stack
operations at row buffer
GP - ALU
manages shared
Context Registers
resources such as FP
Row Buffers
•
•
•
Direct PIM to PIM Interaction
memory communicates with memory
within and across chip boundaries
without external control processor
intervention by “parcels”
exposes fine grain parallelism intrinsic to
vector and irregular data structures
e.g. pointer chasing, block moves,
synchronization, data balancing
5/23/2017
Node Logic
Memory
Bus I/F
(PCI)
Dr. Thomas Sterling - HTMT
Petaflops Architecture
FP
Hi Speed
Links
(Firewire)
FP
43
Silicon Budget for
HTMT DRAM PIM
• Designed to provide proper balance of memory &
support for fiber bandwidth
– Different Vortex configurations => different #s
Logic
By Area
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
32MB
32MB
Memory
FtPt ASAP FtPt ASAP
15.9%
SuperScalar Core
50.8%
FtPt ASAP FtPt ASAP
33.3%
HRAM
&
Vortex
Output
32MB
Interface
32MB
• In 2004, 16 TB = 4096 groups of 64 chips
• Each Chip:
Fiber
WDM
Optical
Receiver
44
Holographic 3/2 Memory
Performance Scaling
1998
1 Gbit
Module
capacity
Number of
modules
Access time 1 ms
Readout
1 Gb/s
bandwidth
Record
1 Mb/s
bandwidth
5/23/2017
2001
1 GB
2004
10 GB
105
105
100 s
.1 PB/s
10 s
1 PB/s
1 GB/s
.1 PB/s
•
•
•
•
•
Advantages
petabyte memory
•
competitive cost
•
10 sec access time
•
low power
efficient interface to DRAM
Dr. Thomas Sterling - HTMT
Petaflops Architecture
Disadvantages
recording rate is slower than the
readout rate for LiNbO3
recording must be done in GB chunks
long term trend favors DRAM unless
new materials and lasers are used
45
1.4 m
77oK
1m
0.3 m
4 oK
50 W
SIDE VIEW
Fiber/Wire
Interconnects
1m
3m
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
0.5 m
46
Nitrogen
SIDE VIEW
Helium
77
oK
4oK
Hard Disk
Tape Silo
Array
Array
(40 cabinets)
(400 Silos)
50 W
Fiber/Wire
Interconnects
Front End
Computer
Server
3m
3m
Console
Cable Tray Assembly
0.5 m
220Volts
220Volts
WDM Source
Generator
980 nm
Pumps
Generator
Optical
Amplifiers
(20 cabinets)
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
47
HTMT Facility (Top View)
15 m
27 m
27 m
Cryogenics
Refrigeratio
n Room
25 m
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
48
Floor Area
1.
2.
3.
4.
5.
6.
7.
HTMT
Server
Pump/MG
Laser 980
Disk Farm (80)
Tape Robot Farm (20)
Operator Room
1,000
250
3,000
1,000
1,600
4,000
1,000
TOTAL = 11,850 sq ft
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
49
Power Dissipation by Subsystem
Petaflops Design Point
5/23/2017
Subsystem
Unit Type
Unit Power
# of Units
Total Power
Cryostat/Cooling
System
400 kW
1
400 kW
SRAM
PIM
5W
16 K
80 kW
WDM source/amps
Port
15 W
4K
62 kW
Data Vortex
Subnet
2 kW
128
258 kW
DRAM
PIM
625 mW
32 K
20 kW
HRAM
HRAM
100 mW
128 K
13 kW
Primary Disk
Disk
15 W
100 K
1500 kW
Tape
Silo
1 kW
20
20 kW
Server
Machine
100 kW
1
100 kW
TOTAL
2.4 MW
Dr. Thomas Sterling - HTMT
Petaflops Architecture
50
Subsystem Interfaces
2007 Design Point
Subsystem
RSFQ
SRAM
SRAM
Data Vortex
Data Vortex
DRAM
DRAM
DRAM
Server
Server
Server
HRAM
Interface to Wires/Port Speed/Wire (bps) #ports
Aggregate BW (Byte/s) Wire count type of IF
SRAM
16000
20.0E+9
512
20.5E+15
8.2E+6 wire
RSFQ
1000
2.0E+9
8000
2.0E+15
8.0E+6 TBD
Data Vortex
1000
2.0E+9
8000
2.0E+15
8.0E+6 wire
SRAM
1
640.0E+9
2048
163.8E+12
2.0E+3 fiber
DRAM
1
640.0E+9
2048
163.8E+12
2.0E+3 fiber
Data Vortex
1000
1.0E+9
33000
4.1E+15
33.0E+6 wire
HRAM
1000
1.0E+9
33000
4.1E+15
33.0E+6 wire
Server
1
800.0E+6
1000
100.0E+9
1.0E+3 wire
DRAM
1
800.0E+6
1000
100.0E+9
1.0E+3 (fiber channel)
Disk
1
800.0E+6
1000
100.0E+9
1.0E+3 (fiber channel)
Tape
1
800.0E+6
200
20.0E+9 200.0E+0 (fiber channel)
DRAM
800
100.0E+6 1.00E+05
1.0E+15
80.0E+6 wire
•Same colors indicate a connection between subsystems
•Horizontal lines group interfaces within a subsystem
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
51
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
52
Getting Efficiency
• Contention:
– hardware for bandwidth, logic throughput, hardware arbitration
• Latency:
– multithreaded processor with hardware context switching
– “percolation” for proactive prestaging of executables
• PIM-DRAM & PIM-SRAM provides smart data oriented mechanisms
• Overhead:
– hardware context switching
– in PIM smart synchronization and context management
– proactive percolation performed in PIM
• Starvation:
– dynamic load balancing
– high speed processor for reduced parallelism
– expose/exploit fine grain parallelism
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
53
Multilevel Multithreaded
Execution Model
•Extend latency hiding of multithreading
•Hierarchy of logical thread
•Delineates threads and thread ensembles
•Action sequences, state, and precedence constraints
•Fine grain single cycle thread switching
•Processor level, hides pipeline and time of flight latency
•Coarse grain context "percolation"
•Memory level, in memory synchronization
•Ready contexts move toward processors, pending
contexts towards big memory
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
54
Tera MTA Friends
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
55
Percolation of Active Tasks
• Multiple stage latency management
methodology
• Augmented multithreaded resource
scheduling
• Hierarchy of task contexts
• Coarse-grain contexts coordinate in
PIM memory
• Ready contexts migrate to SRAM
under PIM control releasing threads
for scheduling
• Threads pushed into SRAM/CRAM
frame buffers
• Strands loaded in register banks on
space available basis
5/23/2017
Strands
Stored in
Regs
Threads Stored in SRAM
Dr. Thomas Sterling - HTMT
Petaflops Architecture
Contexts
Stored in
DRAM
56
HTMT Percolation Model
CRYOGENIC AREA
DMA to CRAM
start
Split-Phase
Synchronization
to SRAM
done
C-Buffer
A-Queue
Parcel
Dispatcher
&
Dispenser
I-Queue
Parcel
Assembly
Re-Use
&
Disassembly
D-Queue
Parcel
Invocation
&
Termination
T-Queue
Run Time System
SRAM-PIM
5/23/2017
DMA to DRAM-PIM
Dr. Thomas Sterling - HTMT
Petaflops Architecture
57
HTMT Execution
Model
“Contexts” in SRAM
Data Structures
“Contexts”
in CRAM
V
O
R
T
E
X
C
N
E
T
SPELL
DRAM PIMs
5/23/2017
SRAM PIMs
Dr. Thomas Sterling - HTMT
Petaflops Architecture
58
DRAM PIM Functions
• Initialize data structures
• Stride thru regular data structures, transferring
to/from SRAM
• Pointer chase thru linked data structures
• “Join-like” operations
• Reorderings
• Prefix operations
• I/O transfer management
– DMA, compress/decompress, ...
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
59
SRAM PIM Functions
• Initiate Gather/Scatter to/from DRAM
• Recognize when sufficient operands arrive in SRAM
context block
• Enqueue/Dequeue SRAM block addresses
• Initiate DMA transfers to/from CRAM context block
• Signal SPELL re task initiation
• Prefix operations like Flt Pt Sum
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
60
StrawMan Prototype for Phase 4
Number
of Units
Total
Capability
100 Gflops
128
10 Tflops
CRAM
8 Kbytes
512
4 Mbytes
SRAM
1 Mbyte/1 proc.
16K
16 Gbytes
Subsystem
Processors
Data Vortex
5/23/2017
Unit Capability
4 Gbits/s/8
4K in
128 Tbits/s
DRAM
8 Mbyte/4 proc.
64K
512 Gbytes
HRAM
1 Gbyte
8K
8 Tbytes
Dr. Thomas Sterling - HTMT
Petaflops Architecture
61
1.4 m
77oK
1m
0.3 m
4 oK
50 W
SIDE VIEW
Fiber/Wire
Interconnects
1m
3m
5/23/2017
Dr. Thomas Sterling - HTMT
Petaflops Architecture
0.5 m
62
Related documents