Download Power-Efficient Medical Image Processing using PUMA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Power-Efficient Medical Image
Processing using PUMA
Ganesh Dasika, Kevin Fan1, Scott Mahlke
University of Michigan
Advanced Computer Architecture Laboratory
1Parakinetics,
Inc.
University of Michigan
Electrical Engineering and Computer Science
The Advent of the GPGPU
• Increasingly popular
substrate for HPC
–
–
–
–
–
Astrophysics
Weather Prediction
EDA
Financial instrument pricing
Medical Imaging
2
University of Michigan
Electrical Engineering and Computer Science
Advantages of GPGPUs
• High degree of parallelism
– Data-level
– Thread-level
• High bandwidth
• Commodity products
• Increasingly programmable
3
University of Michigan
Electrical Engineering and Computer Science
Disadvantages of GPGPUs
• Gap between computation and bandwidth
– 933 GFLOPS : 142 GB/s bandwidth
(0.15B of data per FLOP, ~26:1 Compute:Mem Ratio)
• Very high power consumption
–
–
–
–
Graphics-specific hardware
Multiple thread contexts
Large register files and memories
Fully general datapath
4
Inefficiencies in all
general-purpose
architectures
University of Michigan
Electrical Engineering and Computer Science
Programmability vs Efficiency?
FPGAs
Flexibility
General Purpose
Processors
Highly efficient,
some programmability
DSPs
Domain-specific
Accelerators,
GPGPUs
???
Loop Accelerators,
ASICs
Efficiency
5
University of Michigan
Electrical Engineering and Computer Science
Medical Image Reconstruction
• Compute intensive loops
– 32-bit floating point code
– High data/bandwidth requirements
• Increased demand for portability, low power
• Much current research focuses on using GPGPUs
for this domain
6
University of Michigan
Electrical Engineering and Computer Science
CT Image reconstruction
• X-Ray emitters and
receptors on opposite
sides of patients
• Received x-ray intensity
corresponds to tissue
density
• Multiple scans (“slices”)
taken around patient put
together to reconstruct 1
2D-image
7
University of Michigan
Electrical Engineering and Computer Science
Projection & Sinogram
Sinogram:
All projections

Projection:
All ray-sums in a direction
y
P(t)
t
p

x
f(x,y)
X-rays
8
Sinogram
t
University of Michigan
Electrical Engineering and Computer Science
Example: Backprojection
Sinogram
Backprojected Image
9
University of Michigan
Electrical Engineering and Computer Science
Example:
Filtered Backprojection
Filtered Sinogram
Reconstructed Image
10
University of Michigan
Electrical Engineering and Computer Science
Reconstruction: Solve for m’s
16
m11
m12
m13
m14
m21
m22
m23
m24
m31
m32
m33
m34
m41
m42
m43
m44
22
11
10
X-Ray
Emitter
22
12 “Human
Body“
10
15
Detector
Values
Densities
11
University of Michigan
Electrical Engineering and Computer Science
Real Reconstruction Problem
• Intensity measured
• Rays transmitted
through multiple
“pixels”
• Find individual “pixel” 512
values from
values
transmission data
100’s of
diagonals @
100’s of angles
712
199
255
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
534
417
364
555
501
355
512 values
12
University of Michigan
Electrical Engineering and Computer Science
Medical Imaging Applications
Benchmark
Inner-loop
%Scalar/Vector
Outer-loop TLP
Compute:Mem
ratio
Segmentation
Fully vectorizable
Do-all
4:1
Laplacian Filtering
Fully vectorizable
Do-all
3:1
Gaussian
Convolution
Fully vectorizable
with predicates
Do-all
6:1
MRI FH Vector
Fully vectorizable
Do-all
6:1
MRI Q Vector
Fully vectorizable
Do-all
5.5:1
• Image reconstruction for MRI/CT/PET scans
• Large amounts of Vector/Thread-level parallelism
• FP-intensive kernels
– Often requiring math library functions
• Data-intensive (~5:1 compute:mem ratio)
13
University of Michigan
Electrical Engineering and Computer Science
Current Concerns: Portability/Power
• Currently, most scans require
moving patient to imaging room
– Consumes time
– Stress on patient
• Studies show benefits of portable, bed-side scanners:
– 86% increase in patients suitable for post-stroke thrombolytic
therapy [Weinreb et al, RSNA]
– 80-100% drop in scan-related complications
[Gunnarsson et al, J. of Neurosurgery]
• New X-Ray emitters push for mAs of current use
14
University of Michigan
Electrical Engineering and Computer Science
Current Concerns: Performance
• High-accuracy CT algorithms
take too long
– Iterative forward/backward
projection
– ~Hours on modern CT
scanners instead of minutes
• Interventional radiology
– Scans currently takes
minutes, but should take
seconds
• CT-Flouroscopy
– Several scans done in
succession
15
University of Michigan
Electrical Engineering and Computer Science
Flexibility
• Software algorithms change over time
• NRE
• Time-to-market
16
University of Michigan
Electrical Engineering and Computer Science
PUMA
• Tiled architecture
• Bandwidth-matched for
improved efficiency
• Each tile is a
“Programmable Loop
Accelerator”
Extern. Interface
CPU
17
Mem
Disk
…
University of Michigan
Electrical Engineering and Computer Science
Programmable Loop Accelerator
• Generalize accelerator without losing efficiency
FPGAs
Flexibility
General Purpose
Processors
DSPs
Domain-specific
Accelerators,
GPGPUs
???
Programmable
Loop Accelerators
Loop Accelerators,
ASICs
Efficiency, Performance
18
University of Michigan
Electrical Engineering and Computer Science
Designing Loop Accelerators
Local
Mem
…
…
…
…
…
19
…
Loop
Point-to-point Connections
… …
… …
+
&
Hardware
<<
…
C Code
MEM
…
+
BR
*
…
…
CRF
MEM
Local
Mem
University of Michigan
Electrical Engineering and Computer Science
Loop Accelerator Architecture
CRF
Point-to-point Connections
…
… …
…
FSM
BR
Control
signals
+
&
…
…
MEM
Local
Mem
Hardware realization of modulo scheduled loop
Parameterized hardware:
• Static Control
• FUs
• Point-to-point Interconnect
• Shift Register Files
20
University of Michigan
Electrical Engineering and Computer Science
Programmable Loop-Accelerator
Architecture
CRF
Literals
Point-to-point Connections
Control
FSM
Memory
Control
signals
Ring
…
…
…
Functionality
Storage
Connectivity
Control
…
…
BR
+
+/-
&
&/|
MEM
SRF
RR
SRF
RR
SRF
RR
SRF
RR
LA




…
Local
Mem
PLA
Custom FU set
Generalized FUs + MOVs
Limited size, no addr.
Rotating Reg. Files
Point-to-point
Ring + Port-swapping
Hardwired Control
Lit. Reg. File + Control Mem
21
University of Michigan
Electrical Engineering and Computer Science
MRI.FH PLA
•
•
•
•
mm2
~0.6
per tile
38 FUs
128 32-bit registers
Inter-FU BW 1 TB/sec
22
FU Type
#
FP-ADDSUB
6
FP-MPY
9
I-ADDSUB
8
MEM
9
I-MPY
1
Other
5
University of Michigan
Electrical Engineering and Computer Science
Performance on MRI.FH PLA
Normalized Performance
Unschedulable
1.0
0.8
0.6
0.4
0.2
0.0
MRI.FH
II preserved
MRI.Q
Non-Generalized
CT.segment
CT.laplace
CT.gauss
Generalized
II doubled
23
University of Michigan
Electrical Engineering and Computer Science
Normalized Perf/Power
Efficiency
Efficiency on MRI.FH PLA
1.0
0.8
0.6
0.4
0.2
0.0
MRI.FH
MRI.Q
Non-Generalized
24
CT.segment CT.laplace
CT.gauss
mean
Generalized
University of Michigan
Electrical Engineering and Computer Science
PUMA System Design
• 5 systems designed
around 5 benchmarks
• Each composed of
identical tiles
• Assume same B/W as
GTX280 (142 GB/s)
• # Tiles based on B/W
requirements of
benchmark
25
Extern. Interface
CPU
Mem
Disk
…
University of Michigan
Electrical Engineering and Computer Science
GOPs/sec
System Performance
160
140
120
100
80
60
40
20
0
4W
3W
2.8W
2.3W
2.7W
MRI.FH
MRI.Q
CT.segment
CT.laplace
CT.gauss
Theoretical
26
Realized
University of Michigan
Electrical Engineering and Computer Science
Performance vs. GPGPU
2.0
TOPs/sec
1.5
2X performance of GTS 250
1.0
0.5
0.0
PUMA
GTS 250
Theoretical
GTX 260
GTX 280
GTX 285
GTX 295
Realized
63% performance of GTX 295
27
University of Michigan
Electrical Engineering and Computer Science
Efficiency vs. GPGPU
54X
PUMA Perf/Power
efficiency over GPU
60
50
40
22X
30
20
10
0
MRI.FH
GTS 250
MRI.Q
GTX 260
28
CT.segment
GTX 280
CT.laplace
GTX 285
CT.gauss
GTX 295
University of Michigan
Electrical Engineering and Computer Science
Conclusions
•
•
•
•
Power-efficient accelerator for medical imaging
ASIC-like efficiency with programmability
63-201% of GPU performance
22-54X GPU Performance/Power efficiency
29
University of Michigan
Electrical Engineering and Computer Science
Thank you!!
Questions?
30
University of Michigan
Electrical Engineering and Computer Science
Related documents