Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA [email protected] Partially supported by NSF. Overview FPGA architecture evaluation Area and delay [Rose et al, JSSC’90] Power [Poon et al, FPLA’02][Li et al, FPGA’03] Vdd programmability for power reduction Concept in [FPGA’03] Application to logic [FPGA’04][DAC’04] Application to interconnects [ICCAD’04][Anderson et al, ICCAD’04] Novel circuits and Architecture evaluation for FPGAs with Vdd-programmability Reduce power by 50% with 17% area and 3% delay increase Outline Power modeling and architecture evaluation methodology FPGA Circuits for Vdd Programmability Architecture Evaluation with Vdd programmability Conclusions and Ongoing Work Framework fpgaEva-LP Benchmark circuits Logic Optimization(SIS) Tech-Mapping (RASP) Arch Spec Parasitic Extraction Timing-Driven Packing (TV-Pack) Placement & Routing (VPR) Area Delay Cycle-accurate Power Simulator Power FPGA Structure and Models Cluster-based Island Style FPGA Structure Area and delay models similar to [Betz-RoseMarquardt] 100% buffered interconnects, subset switch block input fc = 50%, output fc = 25% But based on layout and SPICE for 100nm and below Mixed-level power model from [FPGA’03] Dynamic power Capacitive power Short-circuit power ( transition time) Capacitive power Functional switch Glitch Static Power Sub-threshold leakage Reverse biased leakage Gate leakage New Power Model in fpgaEva-LP2 Short-circuit power switching time * switching power fpgaEva-LP used average signal transition time fpgaEva-LP2 calculates transition time for each buffer as tr tbuffer, the buffer delay is NOT a constant 2 as in literature due to input slew is pre-characterized by SPICE buffer delay <0.012 ns < 0.03 ns >0.03 ns α 2 4.4 7 Validation Using SPICE Validate by comparison for each power-component High fidelity with average absolute error of 8% 0.0025 SPICE simulation fpgaEVA-LP fpgaEVA-LP2 0.002 FPGA Power (watt) 0.0015 0.001 0.0005 0 b1 parity cm138a Benchmark Circuits z4ml decode Impact of Random Seeds in VPR 5.6 circuit: s38584 FPGA Energy (nJ/cycle) 5.55 8 10 5.5 3 5.45 6 +5% 5.4 5 2 7 5.35 4 5.3 5.25 10.2 9 +12% 1 10.4 10.6 10.8 11 11.2 11.4 11.6 11.8 12 Critical Path Delay (ns) 12% delay variation and 5% energy variation Min-delay solution among 10 runs is used Total FPGA Energy (nJ/cycle) Evaluation of Single-Vdd FPGAs 9 (12, 7) 8 7 (10, 7) (12, 6) 6 (8, 7) (6, 7) (6, 6) (10, 5) (8, 5) (12, 4) 5 4 (6, 3) (6, 5) (6, 4) (10, 4) (8, 4) 10 11 12 13 14 15 16 17 Critical Path Delay (ns) Cluster size N = {6, 8, 10, 12} LUT size k = {3, 4, 5, 6, 7} Energy-delay (ED) dominant architectures (12, 5) Architectures explored (8, 3) 3 9 (10, 6) (8, 6) (10, 3) (12, 3) Architecture with smaller delay or less energy (compared to any other architecture) Relaxed ED dominant set may be also valuable Energy versus Delay For 100nm ITRS technology Min-Energy arch (N,k)=(10,4) or (8.4) Min-Delay arch (N,k)=(8,7) 0.8x delay but 1.7x power 9 Total FPGA Energy (nJ/cycle) Current commercial architecture (12, 7) 8 7 (10, 7) (12, 6) 6 (8, 7) (10, 6) (8, 6) (6, 7) (6, 6) (10, 5) (8, 5) (12, 4) 5 4 (12, 3) (10, 3) (8, 3) (12, 5) (6, 3) (6, 5) (6, 4) (10, 4) (8, 4) 3 9 10 11 12 13 14 Critical Path Delay (ns) 15 16 17 Outline Power modeling and evaluation methodology FPGA Circuits for Vdd Programmability Architecture Evaluation with Vdd programmability Conclusions and Ongoing Work Vdd-programmable FPGA [DAC’04][ICCAD’04] Vdd-programmable logic block Vdd selection Power-gating unused blocks Vdd-programmable FPGA [FPGA’04][ICCAD’04] Vdd-programmable logic block Vdd selection Power-gating unused blocks Vdd-programmable switch Vdd-level conversion is needed when VddL drives VddH To avoid excessive leakage Vdd-programmable Routing Switch Conventional routing switch Vdd-programmable routing switch Brute-force design [ICCAD’04] Two extra SRAM cells for each routing switch New design One extra SRAM cell NAND2 gate –- minimum size & high-Vt transistor Vdd-Programmable Interconnect Connection Block Brute-force design [ICCAD’04] 2n extra SRAM cells for n connection switches New design Only TWO extra SRAM cells for n connection switches Control logic includes 2n NAND2 and a decoder Power and Delay Vdd-programmable switch uses Compared to conventional switch 4X PMOS power transistor for 7X routing switch 1X PMOS power transistor for 4X connection switch 1000X less leakage power Connection box is 28% faster and has 18% less dynamic power By moving mux from critical path of connection box Switch delay (ns) Energy per switch (Joule) (Vdd=1.3v) Type w/o power transistor w/ power transistor w/o power transistor w/ power transistor Routing 5.9E-11 6.5E-11(+11%) 3.3E-14 3.2E-14 (-2%) Connection 2.9E-10 2.1E-10(-28%) 3.8E-14 3.1E-14(-18%) Vdd-gateable Routing Switch Conventional Vdd-gateable two states Normal Vdd or Power-gating Enable power-gating capability w/o extra SRAM cells Power transitor Can be replaced by tri-state buffer Vdd-gateable Connection Block Conventional Vdd-gateable Enable power-gating capability w/ only one extra SRAM and a low leakage decoder Outline Power modeling and evaluation methodology FPGA Circuits for Vdd Programmability Architecture Evaluation with Vdd programmability Conclusions and Ongoing Work FPGA Architecture Classes Architecture Class Logic Block Interconnect Class0 (baseline) single-Vdd single-Vdd Class1 programmable dual-Vdd programmable dual-Vdd, level converters in routing Class2 programmable dual-Vdd VddH and Vdd-gateable Class3 programmable dual-Vdd Class 1, but no level converters in routing High-Vt is applied to configuration SRAM cells for all the classes Vdd-level Converters Class3 removes Vdd-level converters from interconnects in Class1 With constraints that no VddL drives VddH We developed a routing that one routing tree has a single Vdd level But trees with different Vdd-levels can share the same wire track Alternative approaches: Combined vdd-level converter and buffer [Anderson et al, ICCAD’04] Our new work [DAC’05] allows dual vdd in a tree with a chip level time slack budgeting for extra power reduction Energy versus Delay Total FPGA Energy/Cycle (nJ) LUT 7 6 High Performance Class 0 Class 1 Class 2 Class 3 (8, 7) 5.5 (6, 7) 5 4.5 (12, 4) 4 (8, 7) (6, 7) 3.5 (8,7) (6,7) (8,7)(6,7) 3 2.5 LUT 4 Low Energy (6, 6) (8, 6) (10, 5) (8, 5) 2 (6, 6) (6, 5) (8, 4)(6, 4) (10, 4) (10, 5) (8, 5) (12, 4) (10,6) (6,6) (8,6) (10,5) (8,5) (10,6) (6,6) (12,4) (8,6) (10,5) (8,5) (12,4) (8, 4) (6, 4) 1.5 10 10.5 11 11.5 12 12.5 13 Critical Path Delay (ns) ED-product reduction 20% by Class1 (Vdd-programmable interconnects w/ level converters) 45% by Class2 (Vdd-gateable interconnects) 50% by Class3 (class1 minus level converters) Performance degrades 3% due to Vdd programmability Energy versus Area Class0 6 Class1 Class2 Class3 Total FPGA Energy/Cycle (nJ) (8,7) (6,7) (10,5) 5 (8,5) 4 (6,5) (6,4) (8,4) (10,4) (12,4) (8,7) 3 (6,7) (8,7) (8,6) (8,4) (12,4) (10,4) (8,7) (10,6) (6,6) (8,5) (10,5) (10,4) 2 Min-area Min-energy (8,6) (6,6) (8,4) (6,7) (8,6) (8,5) (10,5) (8,4) (6,4) (6,7) (10,5) (12,4) (6,6) (8,5) (10,6) (6,6) (12,4) 1 6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 2.00E+07 2.20E+07 2.40E+07 2.60E+07 Total FPGA Device Area Average area overhead 118% for Class1 (Vdd-programmable interconnects w/ level converters) 17% for Class2 (Vdd-gateable interconnects) 52% by Class3 (Vdd-programmable interconnects w/o level converters) Class2 is the best considering both energy and area Energy Breakdown Total FPGA Energy (nJ/Cycle) 4.5 2.94% 3.71% 4 16.03% 3.5 8.09% 3 2.70% 3.04% Logic Leakage Energy Logic Dynamic Energy Local Interconnect Leakage Energy Local Interconnect Dynamic Energy Global Interconnect Leakage Energy Global Interconnect Dynamic Energy 2.5 26.22% 2 4.07% 3.92% 7.43% 4.40% 4.32% 49.89% 1.5 39.69% 1 42.93% 9.81% 42.84% 10.81% 5.85% 4.88% 0.5 0 19.33% Class0 37.62% 17.77% Class1 31.70% Class2 Class3 FPGA Architecture (N,k) = (12,4) Class2 and Class3 dramatically reduce global interconnect leakage But class1 fails due to leakage in Vdd-level converters Area Overhead 20% 18% 1.39% Power Transistors & SRAMs (CLBs) 1.80% Vdd-level Converters (CLBs) 4.82% Control (Connection Blocks) 16% Logic Blocks 3.19% 14% FPGA Area Overhead 12% 10% Connection Blocks 10.38% 8% 4.96% Power Transistors (Connection Blocks) 0.60% SRAMs (Connection Blocks) 6% 4% 2% Routing Switches 3.87% 3.87% Power Transistors (Routing Switches) 0% Class2: Vdd-gateable interconnects + Vdd-programmable CLBs(12, 4) 17% = 9% for power transistors + 5% for control + 2% for SRAM Conclusions and New Results Field programmability is needed for fine-grained dual-vdd and Vdd-gating in FPGA Vdd-gating offers a better area-power tradeoff than Vddselection 45% energy-delay product reduction with 17% area overhead Architecture with Vdd-programmability LUT size 4 low energy and area LUT size 7 best performance New results [dac’05] Time slack allocation for Vdd-programmable interconnects Device and architecture co-optimization for 77% energydelay reduction References and Download All references and tools at http://eda.ee.ucla.edu Results in the slides have been updated compared to the paper in ISFPGA’05