Download Embedded Software Architecture for Low Power

1 NATURE: Non-Volatile Nanotube RAM based Field-Programmable Gate Arrays Wei Zhang†, Niraj K. Jha† and Li Shang ‡ †Dept. of Electrical Engineering Princeton University ‡ Dept. of Electrical and Computer Engineering Queen’s University A Hybrid CMOS/NAnoTUbe REconfigurable Architecture Motivation Background on CNT and NRAM Architecture of NATURE Logic Folding Experimental Results Conclusions 3 Motivation Moore’s Law: What’s Next? Carbon nanotubes (CNTs) Nanowires Single electron devices ... Challenges in nano-circuits/architectures Lack of a mature fabrication process Defects and run-time failures Reconfigurable architectures, such as an FPGA, favored Regular structures ease fabrication Fault tolerance through reconfiguration 4 Motivation (Contd.) Problems of existing reconfigurable architectures High reconfiguration time overhead Low area efficiency Some recent works on programmable nanofabrics Molecular logic array (Goldstein et al. [ICCAD 2002]) Nanowire PLA (Dehon et al. [FPGA 2004]) CMOS/nanowire hybrid architecture CMOL (Strukov et al. [Nanotechnology 2005]) Fabrication problem not yet solved 5 Advantages of NATURE CMOS fabrication compatible Run-time reconfiguration NATURE Design flexibility NRAM-based Temporal logic folding Hybrid design leverages beneficial aspects of both CMOS and CNT technologies NRAMs are distributed in NATURE to store multicontext reconfiguration bits Fine-grain reconfiguration (even cycle-by-cycle) Enables temporal logic folding Logic density Flexibility to perform area-performance tradeoffs One-to-two orders of magnitude increase in logic density 6 Background Carbon nanotube (CNT) Metallic or semiconducting Single-wall or multi-wall Diameter: 1-100nm Length: up to millimeters Ballistic transport Excellent thermal conductivity Very high current density High chemical stability Robust to environment Source: Euronanotrade 7 Background (Contd.) Source: Nantero Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable on/off states Fully CMOS-compatible manufacturing process Prototype chip: 10 Gbit NRAM Will be ready for the market in the near future 8 NRAMs Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM 9 Architecture of NATURE Length-1 Length-4 wire wire LB Long wire Switch box Island-style logic blocks (LBs) connected by various levels of interconnects Connection block Length-4 wire Direct link Long wire S1 S1 Switch matrix Switch block S1: Switch box between length-1 wires S2: Switch box between length-4 wires SMB An LB contains a super macroblock (SMB) and a local switch matrix Switch matrix: Local routing network S1 S1 Length-1 wire 10 Architecture of a Super Macroblock (SMB) NRAM MB ---1 ---1 44 44 MB ---8 NRAM ---8 n1 macroblocks (MBs) comprise an SMB, here n1 = 4 SRAM bits SRAM bits 48 to 16 crossbar 6 48 to 16 crossbar ---1 6 ---1 From Switch matrix From Switch matrix ---1 48 to 16 crossbar SRAM bits MB NRAM CLK and Global signals Reconfiguration bits 44 ---8 MB ---8 ---1 ---1 NRAM From Switch matrix 48 to 16 crossbar 44 SRAM bits 6 ---1 6 32 Outputs of SMB CLK and Global signals Reconfiguration bits 11 Architecture of a Macroblock (MB) 7 NRAM 8 LE ---1 ---2 ---2 LE ---4 ---4 8 NRAM ---1 7 n2 logic elements (LEs) comprise an MB, here n2 = 4 48 SRAM bits 48 SRAM bits 12 to 4 crossbar ---4 ---4 12 to 4 crossbar Inputs to MB 8 Outputs of MB Inputs to MB 12 to 4 crossbar 12 to 4 crossbar 48 SRAM bits CLK and Global signals Reconfiguration bits 7 LE ---1 ---2 LE ---2 NRAM ---1 7 ---4 ---4 8 8 48 SRAM bits ---4 ---4 Inputs to MB NRAM CLK and Global signals Reconfiguration bits 12 Logic Element and Interconnect An LE implements a computation and contains: SRAM cells An m-input look-up table (LUT) A flip-flop A pass transistor m-input LUT CLK SMB Interconnect MB MB MB NRAM 0 MB ---4 ---2 One input ---2 Length-1 64 tracks ---4 Length-4 128 tracks Long wire 64 tracks ---8 Mixed wire segment scheme 25%, 50% and 25% distribution for length-1, length-4 and long wires Direct links from one LB to its 4 neighbors DFF Direct link 128 tracks (a) 13 Support for Reconfiguration NRAM Structure Bit line decoder Word line decoder Read Voltage Electrode SRAM Cell Pulldown Resistor Reconfiguration time short: 160ps Area overhead of NRAMs k: no. of reconfiguration sets per NRAM, assume k = 16 Area overhead: 20.5% per LB, assuming 100nm technology for CMOS logic and nanotube length Logic density = k (conf. copies) x area per configuration = 16*(1-0.205)=12.75 Appropriate value for k obtained through design space exploration 14 Temporal Logic Folding Basic idea: one can use NRAM-enabled run-time reconfiguration to realize different Boolean functions in the same logic element (LE) every few cycles LUT3 d g LUT1 a b OUT i e c l h f LUT2 NRAM a e b i c LUT 1 f h d LUT 2 i = abc’ l g LUT 3 OUT l = (i’+e’+f’)h’ OUT = d’g’+l Cycle 1 Cycle 2 LUT 1 OUT Cycle 3 15 Example Without logic folding x0 x1 x2 x3 With logic folding y0 y1 y2 y3 x0 x1 x2 x3 LE2 LE1 a0 LE3 b0 c0 LE4 LE5 LE6 Out LE2 LE1 Num of LEs =6 Num of LEs =2 a0 LE1 b0 Delay = 4 LE delays +Interconnect delay y0 y1 y2 y3 Reconfiguration c0 LE1 Delay =4*clock_period LE2 LE1 Out Clock period =LE delay +Reconfiguration +Interconnect delay 16 Folding Levels Logic folding can be performed at different levels of granularity, providing flexibility to perform area-performance trade-offs A level-p folding implies reconfiguration of the LE after the execution of p LUT computations Macroblock1 z0 z1 z2 y0 y1 y2 y3 a0 b0 x0 x1 x2 x3 e0 Macroblock1 LUT node c0 x0 x1 x2 x3 d0 g0 Reconfiguration y0 y1 y2 y3 a0 Macroblock2 z0 z1 z2 b0 c0 x0 x1 x2 x3 y0 y1 y2 y3 x0 x1 x2 x3 y0 y1 y2 y3 f0 d0 e0 f0 a2 a3 a4 a6 h0 Reconfiguration a2 a3 a4 a6 h0 g0 i0 i0 d Output (a) level-1 folding d Output (b) level-2 folding 17 Choosing the Folding Level Folding level Clock period increases: Routing delay increases Number of clock cycles decreases Reconfiguration time decreases Number of LEs increases Total delay typically decreases Area increases Advantages of logic folding Significant flexibility for performing area-performance trade-offs Ability to map much larger circuits using the same number of LEs Significant improvement in the area/circuit delay product Reduction in the need for global routing 18 Experimental Setup Instance of architecture: 4 MBs in an SMB, 4 LEs in an MB, and LEs contain a 4-input LUT Number of reconfiguration copies k varied in order to compare implementations corresponding to selected folding levels: level-1, level-2, level-4 and no logic folding Results based on 100nm CMOS technology parameters 19 Experimental Results #LEs * Delay for different folding levels Delay (ns) for different folding levels Lev el-1 Lev el-2 Lev el-4 Lev el-1 No-folding Lev el-2 Lev el-4 No-folding 10 1.5 1.3 1.1 0.9 1 0.7 0.5 0.3 0.1 alu2 9symml ldd lal cordic poler8 cc z4ml cm163a sct alu2 9symml ldd lal cordic poler8 cc z4ml cm163a sct (normalized to level-1) pm1 0.1 pm1 -0.1 (normalized to level-1) Average area-time product advantage = 2X Maximum area-time product advantage = 3X 20 Experimental Results (Contd.) 16-RCA: 16-bit ripple carry adder 16-CLA: 16-bit carry lookahead adder 16-CSA: 16-bit carry select adder 8-MUL: 8-bit multiplier #LEs * Delay for different folding levels Delay (ns) for different folding levels Lev el-1 Lev el-2 Lev el-4 Lev el-1 No-folding Lev el-2 Lev el-4 No-folding 100 1.5 1.3 1.1 10 0.9 0.7 0.5 1 0.3 0.1 32-MUL 16-MUL 8-MUL 64-CSA 32-CSA 16-CSA 64-CLA 32-CLA 16-CLA 64-RCA 32-RCA 0.1 16-RCA 32-MUL (normalized to level-1) 16-MUL 8-MUL 64-CSA 32-CSA 16-CSA 64-CLA 32-CLA 16-CLA 64-RCA 32-RCA 16-RCA -0.1 (normalized to level-1) Average area-time product advantage = 13X Maximum area-time product advantage = 35X 21 Experimental Results (Contd.) Flexibility in performing area-performance trade-off For area-time (AT) product, larger the circuit depth, more the advantages of level-1 folding relative to no folding For the 64-bit ripple-carry adder, this advantage is about 35X LE utilization and logic density very high, with a reduced need for a deep interconnect hierarchy 22 Conclusions NATURE: A novel high-performance run-time reconfigurable architecture Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding Choice of different folding levels allows the flexibility of performing area-performance trade-offs Logic density and area-time product improved significantly Can be very useful for cost-conscious embedded systems and future FPGA improvement 23

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Embedded Software Architecture for Low Power