Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan1, Scott Mahlke University of Michigan Advanced Computer Architecture Laboratory 1Parakinetics, Inc. University of Michigan Electrical Engineering and Computer Science The Advent of the GPGPU • Increasingly popular substrate for HPC – – – – – Astrophysics Weather Prediction EDA Financial instrument pricing Medical Imaging 2 University of Michigan Electrical Engineering and Computer Science Advantages of GPGPUs • High degree of parallelism – Data-level – Thread-level • High bandwidth • Commodity products • Increasingly programmable 3 University of Michigan Electrical Engineering and Computer Science Disadvantages of GPGPUs • Gap between computation and bandwidth – 933 GFLOPS : 142 GB/s bandwidth (0.15B of data per FLOP, ~26:1 Compute:Mem Ratio) • Very high power consumption – – – – Graphics-specific hardware Multiple thread contexts Large register files and memories Fully general datapath 4 Inefficiencies in all general-purpose architectures University of Michigan Electrical Engineering and Computer Science Programmability vs Efficiency? FPGAs Flexibility General Purpose Processors Highly efficient, some programmability DSPs Domain-specific Accelerators, GPGPUs ??? Loop Accelerators, ASICs Efficiency 5 University of Michigan Electrical Engineering and Computer Science Medical Image Reconstruction • Compute intensive loops – 32-bit floating point code – High data/bandwidth requirements • Increased demand for portability, low power • Much current research focuses on using GPGPUs for this domain 6 University of Michigan Electrical Engineering and Computer Science CT Image reconstruction • X-Ray emitters and receptors on opposite sides of patients • Received x-ray intensity corresponds to tissue density • Multiple scans (“slices”) taken around patient put together to reconstruct 1 2D-image 7 University of Michigan Electrical Engineering and Computer Science Projection & Sinogram Sinogram: All projections Projection: All ray-sums in a direction y P(t) t p x f(x,y) X-rays 8 Sinogram t University of Michigan Electrical Engineering and Computer Science Example: Backprojection Sinogram Backprojected Image 9 University of Michigan Electrical Engineering and Computer Science Example: Filtered Backprojection Filtered Sinogram Reconstructed Image 10 University of Michigan Electrical Engineering and Computer Science Reconstruction: Solve for m’s 16 m11 m12 m13 m14 m21 m22 m23 m24 m31 m32 m33 m34 m41 m42 m43 m44 22 11 10 X-Ray Emitter 22 12 “Human Body“ 10 15 Detector Values Densities 11 University of Michigan Electrical Engineering and Computer Science Real Reconstruction Problem • Intensity measured • Rays transmitted through multiple “pixels” • Find individual “pixel” 512 values from values transmission data 100’s of diagonals @ 100’s of angles 712 199 255 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 534 417 364 555 501 355 512 values 12 University of Michigan Electrical Engineering and Computer Science Medical Imaging Applications Benchmark Inner-loop %Scalar/Vector Outer-loop TLP Compute:Mem ratio Segmentation Fully vectorizable Do-all 4:1 Laplacian Filtering Fully vectorizable Do-all 3:1 Gaussian Convolution Fully vectorizable with predicates Do-all 6:1 MRI FH Vector Fully vectorizable Do-all 6:1 MRI Q Vector Fully vectorizable Do-all 5.5:1 • Image reconstruction for MRI/CT/PET scans • Large amounts of Vector/Thread-level parallelism • FP-intensive kernels – Often requiring math library functions • Data-intensive (~5:1 compute:mem ratio) 13 University of Michigan Electrical Engineering and Computer Science Current Concerns: Portability/Power • Currently, most scans require moving patient to imaging room – Consumes time – Stress on patient • Studies show benefits of portable, bed-side scanners: – 86% increase in patients suitable for post-stroke thrombolytic therapy [Weinreb et al, RSNA] – 80-100% drop in scan-related complications [Gunnarsson et al, J. of Neurosurgery] • New X-Ray emitters push for mAs of current use 14 University of Michigan Electrical Engineering and Computer Science Current Concerns: Performance • High-accuracy CT algorithms take too long – Iterative forward/backward projection – ~Hours on modern CT scanners instead of minutes • Interventional radiology – Scans currently takes minutes, but should take seconds • CT-Flouroscopy – Several scans done in succession 15 University of Michigan Electrical Engineering and Computer Science Flexibility • Software algorithms change over time • NRE • Time-to-market 16 University of Michigan Electrical Engineering and Computer Science PUMA • Tiled architecture • Bandwidth-matched for improved efficiency • Each tile is a “Programmable Loop Accelerator” Extern. Interface CPU 17 Mem Disk … University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator • Generalize accelerator without losing efficiency FPGAs Flexibility General Purpose Processors DSPs Domain-specific Accelerators, GPGPUs ??? Programmable Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 18 University of Michigan Electrical Engineering and Computer Science Designing Loop Accelerators Local Mem … … … … … 19 … Loop Point-to-point Connections … … … … + & Hardware << … C Code MEM … + BR * … … CRF MEM Local Mem University of Michigan Electrical Engineering and Computer Science Loop Accelerator Architecture CRF Point-to-point Connections … … … … FSM BR Control signals + & … … MEM Local Mem Hardware realization of modulo scheduled loop Parameterized hardware: • Static Control • FUs • Point-to-point Interconnect • Shift Register Files 20 University of Michigan Electrical Engineering and Computer Science Programmable Loop-Accelerator Architecture CRF Literals Point-to-point Connections Control FSM Memory Control signals Ring … … … Functionality Storage Connectivity Control … … BR + +/- & &/| MEM SRF RR SRF RR SRF RR SRF RR LA … Local Mem PLA Custom FU set Generalized FUs + MOVs Limited size, no addr. Rotating Reg. Files Point-to-point Ring + Port-swapping Hardwired Control Lit. Reg. File + Control Mem 21 University of Michigan Electrical Engineering and Computer Science MRI.FH PLA • • • • mm2 ~0.6 per tile 38 FUs 128 32-bit registers Inter-FU BW 1 TB/sec 22 FU Type # FP-ADDSUB 6 FP-MPY 9 I-ADDSUB 8 MEM 9 I-MPY 1 Other 5 University of Michigan Electrical Engineering and Computer Science Performance on MRI.FH PLA Normalized Performance Unschedulable 1.0 0.8 0.6 0.4 0.2 0.0 MRI.FH II preserved MRI.Q Non-Generalized CT.segment CT.laplace CT.gauss Generalized II doubled 23 University of Michigan Electrical Engineering and Computer Science Normalized Perf/Power Efficiency Efficiency on MRI.FH PLA 1.0 0.8 0.6 0.4 0.2 0.0 MRI.FH MRI.Q Non-Generalized 24 CT.segment CT.laplace CT.gauss mean Generalized University of Michigan Electrical Engineering and Computer Science PUMA System Design • 5 systems designed around 5 benchmarks • Each composed of identical tiles • Assume same B/W as GTX280 (142 GB/s) • # Tiles based on B/W requirements of benchmark 25 Extern. Interface CPU Mem Disk … University of Michigan Electrical Engineering and Computer Science GOPs/sec System Performance 160 140 120 100 80 60 40 20 0 4W 3W 2.8W 2.3W 2.7W MRI.FH MRI.Q CT.segment CT.laplace CT.gauss Theoretical 26 Realized University of Michigan Electrical Engineering and Computer Science Performance vs. GPGPU 2.0 TOPs/sec 1.5 2X performance of GTS 250 1.0 0.5 0.0 PUMA GTS 250 Theoretical GTX 260 GTX 280 GTX 285 GTX 295 Realized 63% performance of GTX 295 27 University of Michigan Electrical Engineering and Computer Science Efficiency vs. GPGPU 54X PUMA Perf/Power efficiency over GPU 60 50 40 22X 30 20 10 0 MRI.FH GTS 250 MRI.Q GTX 260 28 CT.segment GTX 280 CT.laplace GTX 285 CT.gauss GTX 295 University of Michigan Electrical Engineering and Computer Science Conclusions • • • • Power-efficient accelerator for medical imaging ASIC-like efficiency with programmability 63-201% of GPU performance 22-54X GPU Performance/Power efficiency 29 University of Michigan Electrical Engineering and Computer Science Thank you!! Questions? 30 University of Michigan Electrical Engineering and Computer Science