Download ERSA Slides - Craig Ulmer

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer [email protected] June 27, 2006 Adrian Javelo Craig Ulmer UCLA Sandia National Laboratories/CA Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Ray-Triangle Intersection Algorithm • Möller and Trumbore algorithm (1997) – TUV intersection point • Modified to remove division – – – – – – 24 26 4 15 17 2 Adds Multiplies Compares Delays Inputs Outputs (+4 bits) • Goal: Build for a V2P20 – Catch: Do in 32b Floating-Point – Assume: 5 adds, 6 multiplies, 4 compares + T Outline • Overview – Reusing floating-point hardware • Adapting the Algorithm – Operation Scheduling – Mapping Operations to Units – Intermediate Data Values • Performance Observations – Ongoing Work: Automation • Summary Floating-Point and FPGAs • Floating-Point has been weakness for FPGA • Recent high-quality FP libraries – SNL: Keith Underwood & K. Scott Hemmert – USC, ENS Lyon, Nallatech, SRC, Xilinx • FP units still challenging to work with – Deeply pipelined – Require sizable resources Single-Precision Function Stages Max in V2P20 Add 10 14 Multiply 11 18 Multiply (no denormals) 6 22 Divide 31 4 Implementing a Computational Kernel • Desirable approach: full pipeline – One FP unit per operation – Issue new iteration every cycle • Problems – Rapidly run out of chip space – Input bandwidth – Low utilization on “one-time” ops • Need to consider techniques for reusing units Our Approach: Recycling Architecture • Build wrapper around an array of FP units – Apply traditional compiler techniques – Customize hardware data path Inputs Input Selection Control Intermediate Buffering Outputs Outline • Overview – Reusing floating-point hardware • Adapting the Algorithm – Operation Scheduling – Mapping Operations to Units – Intermediate Data Values • Performance Observations – Ongoing Work: Automation • Summary Operation Scheduling • Sequence execution on FP array • Extract Data Flow Graph (DFG) – Wide and shallow – Need more parallelism • Loop unrolling / Strip Mining – – – – Pad FP units out to latency P Work on a P iterations at a time Sequentially issue strip of P iterations Thus: ignore FP latency in scheduling P Iterations P Step-by-Step Scheduling Single Strip # Adds Multiplies 0 1 2 3 4 5 6 7 8 0 1 2 One Strip: 40% 36% Back-to-Back: 53% 48% < Step-by-Step Scheduling Single Strip # Adds Multiplies Double Strip < # 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 0 9 1 10 2 11 Adds Multiplies 0 One Strip: Back-to-Back: 40% 36% 53% 48% 1 2 Double Strip: 64% 57% Back-to-Back: 80% 72% < Outline • Overview – Reusing floating-point hardware • Adapting the Algorithm – Operation Scheduling – Mapping Operations to Units – Intermediate Data Values • Performance Observations – Ongoing Work: Automation • Summary Mapping Operations to Units • Assign operations in schedule to a specific unit – Assignments affect input selection unit’s hardware • Two strategies: First-Come-First-Serve and a Heuristic Input Selection Unit Input + + x Intermediate Buffering Output x Mapping Effects Multiplexers Required 10 0 10 MUX3 # MUX4 Adds MUX5 MUX6 MUX7 Multiplies < 0 MUX3 # 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 0 0 1 1 2 2 First-Come-First-Serve MUX4 Adds MUX5 MUX6 Multiplies Heuristic MUX7 < Outline • Overview – Reusing floating-point hardware • Adapting the Algorithm – Operation Scheduling – Mapping Operations to Units – Intermediate Data Values • Performance Observations – Ongoing Work: Automation • Summary Buffering Intermediate Values • Necessary for holding values between stages – Input vs. Output Buffering – Block RAM (BRAM) vs Registers • Focus on output buffering w/ registers – “Delay Pipe” houses a strip of P values Port 0 BRAM Port 1 Register Delay Pipe P Registers Two Strategies Independently-Writable Delay Blocks Input Chaining Delay Blocks Input + + Z-1 Z-1 x Z-1 x Z-1 Minimize number of buffers 40 Memories, 40 MUXs + + x x Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 Minimize control logic 81 Memories, 0 MUXs Chaining: 6% Faster, 19% Smaller, and 400% faster to build! Outline • Overview – Reusing floating-point hardware • Adapting the Algorithm – Operation Scheduling – Mapping Operations to Units – Intermediate Data Values • Performance Observations – Ongoing Work: Automation • Summary Performance V2P20 Area • Implemented: – Single-strip – Double-strip – Full-Pipeline (V2P50) Single-strip Double-strip Full Pipeline 70% 79% 199% Clock Rate Single-strip Double-strip Full Pipeline 155 MHz 148 MHz 142 MHz GFLOPS Single-strip Double-strip Full Pipeline 0.9 1.2 7.1 Input Bandwidth (Bytes/clock) Single-strip Double-strip Full Pipeline 7.6 11.3 68 Ongoing Work: Automation • Natural fit for automation Cycles for 128 Operations – Built our own tools – DFG analysis tools – VHDL generation 10,000 9,000 8,000 7,000 6,000 • Experiment – – – – No strip mining Change # of FP units Change # Iterations Find # clocks for 128 iterations 5,000 4,000 3,000 2,000 1,000 1 2 0 4 8 Number of Units 1 2 4 8 16 16 32 Operations/block Concluding Remarks • Reusing FP units enables FPGAs to process larger kernels – Apply traditional scheduling tricks to increase utilization – Algorithm shape affects performance – Simplicity wins • Simple DFG tools go a long ways – Easy to adjust parameters, generate hardware – Focus on kernels instead of complete systems

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ERSA Slides - Craig Ulmer