Download 3-D Graph Processor

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Centrality wikipedia , lookup

Multi-core processor wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Stream processing wikipedia , lookup

Signal-flow graph wikipedia , lookup

Transcript
3-D Graph Processor
William S. Song, Jeremy Kepner, Huy T. Nguyen, Joshua
I. Kramer, Vitaliy Gleyzer, James R. Mann, Albert H.
Horst, Larry L. Retherford, Robert A. Bond, Nadya T.
Bliss, Eric I. Robinson, Sanjeev Mohindra, Julie Mullen
HPEC 2010
16 September 2010
This work was sponsored by DARPA under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions
and
d recommendations
d ti
are th
those off th
the authors
th
and
d are nott necessarily
il endorsed
d
db
by th
the U
United
it d St
States
t Government.
G
t
MIT Lincoln Laboratory
999999-1
XYZ 11/3/2010
Outline
•
Introduction
– Graph algorithm applications
Commercial
DoD/intelligence
– Performance g
gap
p using
g conventional processors
p
•
•
•
999999-2
XYZ 11/3/2010
3-D graph processor architecture
Simulation and performance projection
Summary
MIT Lincoln Laboratory
Graph Algorithms and Ubiquitous
Commercial Applications
Graph Representation
1
2
4
7
3
6
•
•
•
•
•
•
•
•
999999-3
XYZ 11/3/2010
Applications
5
Finding shortest or fastest paths on maps
Communication, transportation, water supply, electricity, and
traffic network optimization
O ti i i
Optimizing
paths
th for
f mail/package
il/
k
delivery,
d li
garbage
b
collection,
ll ti
snow plowing, and street cleaning
Planning for hospital, firehouse, police station, warehouse, shop,
market, office and other building placements
Routing robots
Analyzing DNA and studying molecules in chemistry and physics
Corporate scheduling, transaction processing, and resource
allocation
Social network analysis
MIT Lincoln Laboratory
DoD Graph Algorithm Applications
ISR Sensor Data Analysis
Intelligence Information Analysis
•
Analysis of email, phone
calls, financial transactions,
travel, etc.
–
–
Very large data set analysis
Established applications
•
Post-detection data analysis
for knowledge extraction and
decision support
–
–
–
–
999999-4
XYZ 11/3/2010
Large data set
Real time operations
Small processor size,
weight, power, and cost
New applications
MIT Lincoln Laboratory
Sparse Matrix Representation of Graph
Algorithms

1
2
4
7
3
AT
•
•
x
5
6
ATx
Many graph algorithms may be represented and solved with
sparse matrix algorithms
S
Similar
speed processing
– Data flow is identical
•
Makes good graph processing instruction set
– Useful tool for visualizing parallel operations
999999-5
XYZ 11/3/2010
MIT Lincoln Laboratory
Dense and Sparse Matrix Multiplication
on Commercial Processors
Sparse/Dense Matrix Multiply
Performance on One PowerPC
Sparse Matrix Multiply Performance
on COTS Parallel Multiprocessors
Sparse Matrix Multiply or Graph
s/Sec
Operations
1.E+10
1
E+10
1010
System
System
PowerPC
PowerPC 1.5GHz
1.5GHz
Intel
Intel 3.2GHz
3.2GHz
IBM
IBM P5-570*
P5-570*††
Custom
Custom HPC2*
HPC2*
††fixed problem size
fixed problem size
FL OP/s
sec
1.E+09
109
Ideal
Beff = 1 GB/s
1.E+08
108
COTS Cluster
Model
107
1.E+07
Beff = 0.1 GB/s
106
1.E+06
Number of Non-zeros Per Column/Row
Commercial microprocessors 1000
times inefficient in graph/sparse
matrix operations
p
in part
p
due to
poorly matched processing flow.
999999-6
XYZ 11/3/2010
1
10
100
1000
Processors
Communication bandwidth limitations
and other inefficiencies limit the
performance improvements
p
p
in COTS
parallel processors.
MIT Lincoln Laboratory
Outline
•
•
Introduction
3 D graph processor architecture
3-D
– Architecture
– Enabling technologies
•
•
999999-7
XYZ 11/3/2010
Simulation and performance projection
Summary
MIT Lincoln Laboratory
3-D Graph Processor Enabling Technology
Developments
Cache-less Memory
High Bandwidth 3-D
Communication Network
Proc.
Cache
Mem.
Accelerator Based
Architecture
• Optimized for sparse matrix
processing access patterns
•
•
•
•
3D interconnect (3x)
Randomized routing (6x)
Parallel paths (8x)
144x combined bandwidth while
maintaining low power
3 D GRAPH PROCESSOR
3-D
• 1024 Nodes
• 75000 MSOPS*
• 10 MSOPS/Watt
Custom Low Power
Circuits
Data/Algorithm Dependent
Multi-Processor Mapping
Sparse Matrix Based
I t
Instruction
ti Set
S t
• Efficient load balancing and
memory usage
*Million Sparse Operations Per Second
999999-8
XYZ 11/3/2010
• Dedicated VLSI computation
modules
• Systolic sorting technology
• 20x-100x throughput
• Full custom design for
critical circuitry (>5x power
efficiency)
• Reduction in p
programming
g
g
complexity via parallelizable array
data structure
MIT Lincoln Laboratory
Sparse Matrix Operations
•
Operation
Distributions
Comments
C = A +.* B
Works with all
supported matrix
di t ib ti
distributions
Matrix multiply operation is the
throughput driver for many important
b
benchmark
h
k graph
h algorithms.
l
ith
Processor architecture highly
optimized for this operation.
C = A .± B
C = A .* B
C = A ./ B
A, B, and C has
to have identical
distribution
Dot operations performed within local
memory.
B = op(k,A)
Works with all
supported matrix
distributions
Operation with matrix and constant.
Can also be used to redistribute matrix
and sum columns or rows.
The +, -, *, and / operations can be replaced with any
arithmetic or logical operators
– e.g.
e g max,
max min
min, AND
AND, OR
OR, XOR
XOR, …
•
Instruction set can efficiently support most graph
algorithms
– Other peripheral operations performed by node controller
999999-9
XYZ 11/3/2010
MIT Lincoln Laboratory
Node Processor Architecture
Node Processor Communication Network
Matrix
Reader
Row/Column
Reader
Matrix
Writer
Matrix
ALU
Sorter
Node
Controller
Communication
Memory Bus
Control Bus
N d Processor
Node
P
Memory
Connection to
Global Control Bus
Connection to
Global
C
Communication
i ti
Network
Accelerator based architecture for high throughput
999999-10
XYZ 11/3/2010
MIT Lincoln Laboratory
High Performance Systolic k-way Merge
Sorter
4-way Merge Sorting Example
1
1
12
3
5
5
1
2
12
1
2
2
3
6
2
4
2
3
6
4
4
5
6
9
4
6
1
9
1
7
2
2
8
2
9
9
12
8
2
4
8
9
6
7
4
9
7
Systolic Merge Sorter
•
RS
RS
RB
RB
RB
Sorting consists of >90% of graph
processing using sparse matrix algebra
–
•
RS
For sorting indexes and identifying
elements for accumulation
Systolic k-way merge sorter can increase
S
the sorter throughput >5x over
conventional merge sorter
–
999999-11
XYZ 11/3/2010
IL
RS
RB
IR
Multiplexer Network
Comparators & Logic
20x-100x throughput over microprocessor
based sorting
MIT Lincoln Laboratory
6
Communication Performance Comparison
Between 2-D and 3-D Architectures
2-D
3-D
Normalized Bisection Bandwidth
•
999999-12
XYZ 11/3/2010
P
2
10
100
1,000
10,000
100,000
1,000,000
2D
2-D
1
32
3.2
10
32
100
320
1 000
1,000
3-D
1
4.6
22
100
460
2,200
10,000
3-D/2-D
1
1.4
2.2
3.1
4.6
6.9
10
Significant bisection bandwidth advantage for 3-D
g processor
p
count
architecture for large
MIT Lincoln Laboratory
3-D Packaging
Coupling
Connector
Stacked Processor
Boards
3-D Processor
Chassis
3-D Parallel
Processor
Cold Plate
TX/RX
Insulator
Routing Layer
Processor IC
Heat Removal
Layer
Coupler
•
TX/RX
C
Coupling
li
Connectors
Electro-magnetic coupling communication in vertical
dimension
– Enables 3-D
3 D packaging and short routing distances
•
8x8x16 planned for initial prototype
999999-13
XYZ 11/3/2010
MIT Lincoln Laboratory
2-D and 3-D High Speed Low Power
Communication Link Technology
2-D Communication Link
3-D Communication Link
Inductive Couplers
Time
•
Coupler TX/RX
Signal Simulation
Amplitu
ude
Signal Simulation
Amplitu
ude
Current Mode TX/RX
Time
High speed 2-D
2 D and 3
3-D
D communication achievable with low power
–
–
–
>1000 Gbps per processor node compared to COTS typical 1-20 Gbps
Only 1-2 Watts per node communication power consumption
2-D and 3-D links have same bit rate and similar power consumption
Power consumption is dominated by frequency multiplication and phase
locking for which 2-D and 3-D links use common circuitry
•
999999-14
XYZ 11/3/2010
Test chip under fabrication
MIT Lincoln Laboratory
Randomized-Destination 3-D Toroidal
Grid Communication Network
3-D Toroidal Grid
Simulation Results
Node Router
Input Queue Size vs. Network
Queue Size vs. Total Throughput
X Standby FIFO 1
Total Throughpu
ut
(packets/cycle)
X Standby FIFO 2
Communications
Arbitration
Engine
3:1 Y Standby FIFO 1
3:1 Y Standby FIFO 2
5:1 Z Standby FIFO 1
2:1 FIFO
7:1 FIFO
5:1 Z Standby FIFO 2
Node
•
Randomized Destination
…
• Randomized destination
packet sequence
• 87% full network
efficiency achieved
Unique Destination
• Unique destination for all
packets from one source
• 15% full network
efficiency achieved
999999-15
XYZ 11/3/2010
Randomizing destinations of
packets from source nodes
p
d
dramatically
i ll increases
i
network
k
efficiency
–
–
•
Reduces contention
Algorithms developed to break
up and randomize any localities
in communication patterns
6x network efficiency achieved
over typical COTS multiprocessor
networks with same link
bandwidths
MIT Lincoln Laboratory
Hardware Supported Optimized
Row/Column Segment Based Mapping
•
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Efficient distribution of matrix elements necessary
– To balance processing load and memory usage
•
Analytically optimized mapping algorithm developed
– Provides very
y well balanced p
processing
g loads when matrices
or matrix distributions are known in advance
– Provides relatively robust load balancing when matrix
distributions deviate from expected
– Complex
p
mapping
pp g scheme enabled by
y hardware support
pp
•
Can use 3-D Graph Processor itself to optimize mapping in
real time
– Fast computing of optimized mapping possible
999999-16
XYZ 11/3/2010
MIT Lincoln Laboratory
Programming, Control, and I/O Hardware
Hardware System Architecture
and Software
Node Array
Node Processor
Node Array
Host
Node Array
Controller
Node Array
User Application
Application Optimization
Kernel Optimization
p
Middleware Libraries
Host API
Co-Processor Device
Driver
Command Interface
Node Command Interface
Master Execution Kernel
Node Execution Kernel
Microcode Interface
Specialized Engines
999999-17
XYZ 11/3/2010
MIT Lincoln Laboratory
Outline
•
•
•
Introduction
3 D graph processor architecture
3-D
Simulation and performance projection
– Simulation
– Performance projection
Computational throughput
Power efficiency
•
999999-18
XYZ 11/3/2010
Summary
MIT Lincoln Laboratory
Simulation and Verification
Node Array
Node Processor
Node Array
Host
Controller
Node Array
Node Array
•
•
•
•
999999-19
XYZ 11/3/2010
Bit level accurate simulation of 1024-node system used for
functional verifications of sparse matrix processing
Memory performance verified with commercial IP simulator
C
Computational
module performance
f
verified
f
with process
migration projection from existing circuitries
3-D and 2-D high speed communication performance
verified with circuit simulation and test chip design
MIT Lincoln Laboratory
Performance Estimates
(Scaled Problem Size, 4GB/processor)
Performance per Watt
1.E+11
1010
Current Estimate
System
System
y
PowerPC
PowerPC 1.5GHz
1.5GHz
Intel
Intel 3.2GHz
3.2GHz
IBM
IBM P5-570*
P5-570*††
Custom
Custom HPC2*
HPC2*
††fixed problem size
fixed problem size
1.E+06
106
1.E+05
105
Sparse Matrix
x Multiply or G
Graph
Operrations/Sec
Sparse Matrix
x Multiply or G
Graph
Operatiions/Sec/Watt
1.E+07
107
Computational Throughput
1.E+09
Phase 0 Design Goal
1.E+04
104
B
eff
1
Beff = 1 GB/s
108
1 E+08
1.E+08
~20x
1.E+03
103
Current Estimate
1.E+10
109
System
System
PowerPC
PowerPC 1.5GHz
1.5GHz
Intel
Intel 3.2GHz
3.2GHz
IBM
IBM P5-570*
P5-570*††
Custom
Custom HPC2*
HPC2*
††fixed problem size
fixed problem size
=0
.1
GB
/s
10
100
Processors
COTS Cluster
Model
107
1.E+07
COTS Cluster
B Model
eff =
1G
B/s
Beff = 0.1 GB/s
1.E+06
106
1000
1
10
100
Processors
Close to 1000x graph algorithm computational throughput and
100,000x power efficiency projected at 1024 processing nodes
compared to the best COTS processor custom designed for graph
processing
processing.
999999-20
XYZ 11/3/2010
MIT Lincoln Laboratory
1000
Summary
•
•
Graph processing critical to many commercial, DoD, and
g
applications
pp
intelligence
Conventional processors perform poorly on graph
algorithms
– Architecture poorly match to computational flow
•
MIT LL has
h developed
d
l
d novell 3-D
3 D processor architecture
hit t
well
ll
suited to graph processing
– Numerous innovations enabling efficient graph computing
Sparse
p
matrix based instruction set
Cache-less accelerator based architecture
High speed systolic sorter processor
Randomized routing
3-D
3
D coupler based interconnect
High-speed low-power custom circuitry
Efficient mapping for computational load/memory balancing
– Orders of magnitude higher performance projected/simulated
999999-21
XYZ 11/3/2010
MIT Lincoln Laboratory