Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto 1 Motivation: Datapath Regularity • Larger FPGAs – Larger applications on FPGAs – More datapath logic in larger applications – Datapath logic is highly regular • In custom ASIC regularity is routinely utilized to increase logic density • Can regularity also be utilized to improve the logic density of FPGAs? 2 Previous Work • Datapath-FPGA (DP-FPGA) study [cher96] – Yes, datapath regularity can be utilized to reduce FPGA area by as much as 50% – Based on a partially specified FPGA architecture • Major simplifying assumptions – All transistors are minimum width – Datapaths are completely regular – No inefficiency from the CAD tools 3 This Work – An In-depth Study on Datapath Regularity • Designed a new datapath-oriented FPGA architecture – With detailed architectural specifications – With correctly sized transistors • Utilized realistic datapath benchmarks – From the Pico-java processor from SUN • Created a complete set of CAD tools to support the new architecture – Taking CAD inefficiency into account 4 Multi-bit FPGA (MB-FPGA) • Architected to utilize datapath regularity to generate area savings • Architectural features – Capture regularity using special logic blocks called super-clusters – Increase logic density through configuration memory sharing routing resources 5 MB-FPGA – Overview L Routing Channels S L L L Logic Block L S Switch Block Conf. Mem. Shar. Routing Tracks Conventional Routing Tracks 6 MB-FPGA – Logic Block Cluster = Bit-Slice LRN MUX LUT BLE BLE BLE BLE Cluster 4 DFF BLE Local BLE Routing BLE Network BLE BLE BLE BLE BLE Cluster 3 LRN BLE BLE BLE BLE Cluster 2 LRN LRN BLE BLE BLE BLE Cluster 1 M A Basic Logic Element (BLE) 7 Capturing Datapath Regularity BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE BLE Bit-Slice 1 Bit-Slice 2 Bit-Slice 3 8 MB-FPGA – Routing Architecture Cluster Cluster Cluster Cluster M M M M Conventional Routing M Conf. Mem. Shar. Routing Switch Block Logic Block M M 9 Utilizing Datapath Regularity to Save Area L M L Conf. Mem. Shar. Tracks L L 10 Area Estimation Using Correct Transistor Sizing • Based on the fully specified MB-FPGA architecture • Detailed Assumptions – SRAM transistors are min. width – Tri-state buffers are 5x min. width – 75% FPGA area is routing area • Simplified Assumptions – Datapaths are completely regular (all conf. mem. shar. tracks) – No inefficiency from the CAD tools 11 Area Estimation Using Correct Transistor Sizing • Datapath regularity can only be used to reduce the MB-FPGA area by 25% • Down from the 50% area savings prediction of the DP-FPGA study [cher96] 12 Benchmark Regularity • Fifteen benchmark circuits – From the Pico-java processor – Implemented on the MB-FPGA • Measurements after synthesis – Logic regularity – Net regularity 13 Logic Regularity • Classify LUTs and DFFs into two types – Irregular type • LUTs and DFFs that do not belong to any 4-bit wide datapath components – Regular type • LUTs or DFFs that belong to a 4-bit wide datapath component • More regular type of LUTs and DFFs – More regular nets – Greater area savings 14 A Datapath Component A Datapath Component – A Group of 4 identical LUTs or DFFs S4 S3 S2 S1 Identical LUTs or DFFs 15 Logic Regularity #LUT + #DFF dcu_dpath ex_dpath icu_dpath imdr_dpath pipe_dpath smu_dpath ucode_dat ucode_reg code_seq_dp exponent_dp Incmod mantissa_dp multmod_dp prils_dp rsadd_dp Total 1254 2917 3590 1388 689 683 1528 156 439 565 939 1070 1827 #LUT + #DFF in Datapath Components 1188 2740 3460 1292 575 618 1448 132 204 384 836 964 1540 %LUT & DFF in Datapath Components 95% 94% 96% 93% 83% 90% 95% 85% 46% 68% 89% 90% 84% 388 305 17738 324 281 15986 84% 92% 90% 16 Net Regularity • Classify two-terminal connections in each circuit into three types – Regular 4-bit wide buses – Regular 4-bit wide control group – Irregular • Two-terminal connections do not belong to either a bus or a control group 17 Definition – Net Regularity A 4-bit wide bus S4 S3 S2 S1 S4 S3 S2 S1 A 4-bit wide control group S4 S3 S2 S1 Note: Only 4-bit wide buses can be used to increase the area efficiency of MB-FPGA through conf. mem. shar. routing tracks 18 Net Regularity Total Two-Terminal Connections dcu_dpath ex_dpath icu_dpath imdr_dpath pipe_dpath smu_dpath ucode_dat ucode_reg code_seq_dp exponent_dp incmod mantissa_dp multmod_dp prils_dp rsadd_dp Total 2232 6547 8047 3100 1049 1167 3143 194 799 1362 2013 2533 3380 % of Two-Terminal Connections in 4Bit Wide Buses 49% 52% 47% 50% 48% 48% 52% 72% 58% 32% 42% 47% 39% % of Two-Terminal Connections in Fan-Out 4 Groups 43% 39% 36% 36% 42% 25% 41% 21% 18% 23% 33% 36% 25% 864 722 37152 41% 52% 48% 32% 27% 35% 19 Area Estimation Based on Correct Net Regularity • Assumptions – SRAM transistors are min. width – Tri-state buffers are 5x min. width – 75% FPGA area is routing area – 50% of routing tracks are conf. mem. shar. – No inefficiency from the CAD tools • Result – Datapath regularity can be utilized to reduce FPGA area by 12% (again down from 25%) 20 Datapath-oriented CAD Flow – Overview Enhanced Module Compaction Synthesis Coarse-grain Node Graph Packing Multi-bit FPGA Placement Coarse-grain Resource Routing 21 Can Regularity Be Utilized to Improve Logic Density? • To achieve best area – What should be the best number of clusters per logic block? – What should be the best number of conf. mem. shar. routing tracks per routing channel? • What is the performance this datapath-oriented FPGA? 22 Experiments • Fifteen benchmark circuits – From the Pico-java processor – Implemented on the MB-FPGA • Experiments – Granularity (the number of clusters per logic block) vs. Area – % conf. mem. shar. tracks vs. area – % conf. mem. shar. tracks vs. performance 23 Granularity Vs. Area • Explored a 2-D architectural space – First vary granularity – For each granularity: vary % of conf. mem. shar. routing tracks per routing channel • For each architecture, find the average area required to implement the benchmark circuits • Plot best area for each granularity 24 Granularity Vs. Area Avg. Area (10e6) 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 2 4 8 12 16 20 Granularity 24 28 32 25 % C.M.S. Tracks Vs. Area • Assume four clusters per logic block for the MB-FPGA • For each circuit – Set a fixed number of conf. mem. shar. tracks – Search for minimum number of additional conv. tracks • Classify into eight percentile ranges • Use the minimum area obtainable for each circuit to calculate average area 26 % C.M.S. Tracks Vs. Area • Also implement the same benchmarks on a comparable conventional FPGA • MB-FPGA area is normalized against the conventional FPGA area 27 % C.M.S. Tracks Vs. Area Normalized Avg. Area 1.00 0.98 0.96 10% 0.94 0.92 0.90 0.88 0% 0%- 10%- 20%- 30%- 40%- 50%- 60%10% 20% 30% 40% 50% 60% 70% % Conf. Mem. Shar. Tracks 28 Area (40% - 50% Tracks Are C.M.S.) Conventional FPGA Area (10e5) Datapath-oriented FPGA Area (10e5) icu_dpath ex_dpath multmod_dp imdr_dpath ucode_dat mantissa_dp dcu_dpath incmod exponent_dp smu_dpath pipe_dpath prils_dp code_seq_dp 56.0 50.8 22.4 20.0 18.6 15.5 13.2 9.89 7.02 6.72 5.37 4.77 4.77 48.9 38.8 25.0 17.2 16.1 14.8 11.5 11.6 7.66 6.69 5.19 4.67 4.51 Datapath-oriented FPGA Area (Normalized) 0.87 0.76 1.10 0.86 0.86 0.96 0.87 1.17 1.10 1.00 0.97 0.98 0.95 rsadd_dp ucode_reg Avg. Area 4.16 1.00 16.0 3.56 1.04 14.5 0.86 1.00 0.90 29 Performance (Crit. Path Delay) • Assume carry network delay equal to local routing network delay – Over-estimated carry delay – Results are pessimistic • Normalized average crit. path delay over 15 benchmark circuits with respect to conventional FPGA 30 % C.M.S. Tracks Vs. Crit. Path Normalized Avg. Delay 1.14 1.13 1.12 1.11 1.1 1.09 1.08 0% 0%10% 10%- 20%- 30%- 40%- 50%- 60%20% 30% 40% 50% 60% 70% % Config. Mem. Shar. Tracks 31 Crit. Path Delay (40%- 50% Tracks Are CMS) Conv. FPGA Crit. Path Delay (ns) D.P. FPGA Crit. Path Delay (ns) code_seq_dp dcu_dpath ex_dpath exponent_dp icu_dpath imdr_dpath Incmod mantissa_dp multmod_dp pipe_dpath prils_dp rsadd_dp smu_dpath 16.7 8.27 42.1 21.0 24.9 46.0 45.5 13.9 33.3 13.4 23.0 35.8 29.3 15.5 10.5 47.6 22.5 34.7 47.2 42.8 12.3 36.1 17.2 25.5 39.6 35.3 D.P. FPGA Crit. Path Delay (Normalized) 0.93 1.3 1.1 1.1 1.4 1.0 0.94 0.88 1.1 1.3 1.1 1.1 1.2 ucode_dat ucode_reg Avg. Area 11.3 3.58 20.2 11.6 4.25 22.3 1.0 1.2 1.1 32 Conclusions • Investigated the question – Can regularity be effectively utilized to improve logic density? • Presented – A datapath-oriented FPGA architecture • Fully specified to the level of transistor sizing – An analysis on datapath regularity – A brief description of the CAD flow for the architecture 33 Conclusions • Detailed architectural specification and CAD implementation is very important • Best MB-FPGA architecture – Granularity = 4 – 40%-50% of tracks are C.M.S. • Architectural Results – 10% smaller in area than conv. FPGA – Much less than the 50% area savings prediction [cher96] – Has a 10% performance penalty 34 Discussions • Under what circumstances will MBFPGA be more area efficient? – Applications with more buses than our benchmarks – Wider datapath applications – Larger than 1x min. width transistors in SRAM cells – Smaller than 5x min. width transistors in tri-state buffers – SRAM reduction is more important than area reduction 35 Future Work • Architecture – Sharing configuration memory in logic – Improve performance • CAD tools – Proper modeling of carry network delay – Improve performance – Power modeling 36 Detailed Datapath-oriented CAD Implementation Issues Andy Gean Ye University of Toronto 37 Datapath-oriented CAD Flow – Overview Enhanced Module Compaction Synthesis Coarse-grain Node Graph Packing Multi-bit FPGA Placement Coarse-grain Resource Routing 38 Input to CAD Flow • Netlists of datapath components in Verilog or VHDL • From a pre-defined library – Arithmetic operators – Logic operators – Multiplexers • Datapath regularity of the input is preserved throughout the CAD flow 39 An Example Input Datapath Circuit a0 sel b0 mux c0 b1 mux c1 d0 cin a1 mux b3 mux d3 + s1 a3 c3 d2 + s0 b2 c2 d1 + a2 cout + s2 s3 40 Synthesis • Synopsys FPGA compiler has 38% area inflation when instructed to preserve datapath regularity • Two major causes of area inflation – Duplicated logic across bit-slices – Bit-slices are too small • Augmented FPGA compiler with new algorithms – Reduced the area inflation to 3% 41 Packing • Based on the T-VPACK [betz99] algorithm • Like T-VPACK – timing driven • New feature – ability to preserve datapath regularity 42 After Synthesis and Packing a0 b0 c0 sel a1 b1 c1 sel BLE a2 b2 c2 sel BLE a3 b3 c3 sel BLE BLE bus d0 cin d1 BLE BLE s0 d2 BLE BLE s1 d3 BLE BLE s2 BLE BLE cout s3 43 Placement and Routing • Based on the VPR tools [betz99] – Placer: simulated annealing [kirk83] – Router: congestion negotiationbased pathfinder [ebel95] • New feature of the placer – Ability to move individual clusters if they do not contain datapath – Move entire logic block if they contain datapath to preserve datapath regularity 44 Router • Contains a new set of expansion cost functions – Designed to ease the task of comparing the cost of using conv. tracks against the cost of using conf. mem. shar. tracks – Composed of delay and congestion metrics (similar to the conventional expansion cost) 45 Overall Routing Flow Route Buses Route Non-bus Signals Update Cost Functions 46 Routing Buses • Route entire buses through conf. mem. shar. routing tracks • Route the first bit through conv. routing tracks – test for delay and congestion • Compare expansion costs • Select the option with the lowest expansion cost 47 Routing Non-bus Signals • Consider the options of routing the signal through conv. as well as conf. mem. shar. tracks • Compare the expansion cost • Select the option with the lowest expansion cost 48