Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Design and Demonstration of Micro-Electro-Mechanical Relay Multipliers Hossein Fariborzi1 ,Fred Chen1, Rhesa Nathanael2, Jaeseok Jeon2, Tsu-Jae King Liu2, and Vladimir Stojanovic1 1Massachusetts Institute of Technology, Cambridge, MA, USA 2University of California, Berkeley, CA, USA RQE Meeting 5/26/2011 Number of cores CMOS Scaling and Power Consumption Intel 48 core -Xeon Since 2000, VDD and VT are not scaling well Power/area increases with each new process generation Parallelism allows slower, more efficient units Helps improve performance for fixed power Enables lower power for the same throughput 2 25 Normalized Energy/op Normalized Energy/op Subthreshold Leakage: Game-Over for CMOS 20 15 10 Etotal Edynamic 5 0.1 Eleakage 0.2 0.3 0.4 VDD (V) 0.5 More parallelism does not help 1x 11 10 9 8x 2x 8 0 1 2 3 4 5 1/throughput 6 Leakage and sub-threshold slope define minimum energy/op for CMOS Parallelism cannot reduce energy/op if already operating at minimum energy 3 MEM Relays Offer Zero Leakage Measured MEM Relay I-V Curve MEM Relay Energy vs. VDD Etotal=Edynamic MEM Relays show zero leakage & sharp sub-threshold slope Could potentially enable reduced VDD and E/op with scaling R. Nathanael et al., IEDM 2009 4 MEM Relay Structure and Operation Tungsten Channel Tungsten Body Poly-SiGe Gate Poly-SiGe Anchor Poly-SiGe Beam /Flexure Tungsten Source/Drain OFF Switch: |Vgb| < Vpo (pull-out voltage) ON Switch: |Vgb| > Vpi (pull-in voltage) 5 MEM Relay as a Logic Element Gate Source Body Drain 4-terminal design mimics MOSFET operation Electrostatic actuation is ambipolar Non-inverting logic is possible Actuation independent of source/drain voltages 6 Digital Circuit Design with MEM Relays 4 gate delays, 18 transistors 1 mechanical delay, 8 relays Relays CMOS A0 A1 A2 A1 A3 out out A0 A2 A3 A2 A1 A3 F. Chen et al., ICCAD08 Electrical vs. mechanical delay for MEM-Relays -6 VGB=VPI VGB=3VPI 10 Delay (s) -7 VGB=2VPI 10 VGB=5VPI -8 10 Elec. delay (current) -9 10 Mech. delay (current) -10 10 Elec. delay (Predictive model) Mech. delay (Predictive model) -11 10 10 0 10 1 10 2 Stack length 10 3 7 MEM Relay Multiplier: Basic Concept 1 0 1 0 1 0 x 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 + 1 0 1 0 1 0 1 1 1 0 0 1 1 1 0 Multiplier Design tradeoff: Number of stages Multiplicand Multiplicand (i.e., mechanical delays) vs. area Multiplier Example: 32 bit relay multiplier Partial Products Partial products (PP) Result Result CMOS style design: 19 mech. delay, 26k relays Optimized Relay designs: 5-6 mech. delay, 11k-20k relays 2 possible scenarios Using full- and half-adders along the columns Using larger compressors along the columns to achieve higher compression 8 MEM Relay Multiplier: 6-bit Example Multiplier Multiplicand Partial Product Generation matrix Mech. Propagation Elec. Propagation Approach 1: Fulland Half-adders 3 mech. Delay 430 relays LSB (N:3) Compressor Mech. delay N FA 12-bit Multiplication result Partial Product Generation matrix Approach 2: Large compressors 2 mech. Delay 595 relays 4 5 6 7 6 5 12-bit Multiplication result HA LSB 9 (7:3) Compressor CMOS Pass-gate Style Design Truth table N 1 Y2Y1Y0 Ai i 0 Y2 Y2 A6 A6 (7:3) compressor Full Y2, CMOS(1) A5 A5 A5 A5 A4 A4 A4 A4 A4 A4 A3 A3 A3 A3 A3 A3 A3 A3 A2 A2 A2 A2 A2 A2 A1 A1 A1 A1 Relay implementation of CMOS pass-gate compressor: A0 No propagation path accumulating mech. delays in stages 19 mech. delay for a 32 bit multiplier (1) Song et al, JSCC 91 A0 10 Optimized Relay Compressor Design: Y2 Sub-circuit Truth table N 1 Y2Y1Y0 Ai i 0 11 Optimized Relay Compressor Design: Y2 Sub-circuit Truth table N 1 Y2Y1Y0 Ai i 0 Y2 A4 A3 A2 A1 A3 A2 A1 A0 (5:3) compressor propagation paths for Y2 11 Optimized Relay Compressor Design: Y2 Sub-circuit Truth table N 1 Y2Y1Y0 Ai i 0 Y2 Y2 A4 A4 A3 A2 Kill A3 A3 A2 A2 A3 A2 Generate A1 A1 A1 A1 A1 A0 A0 (5:3) compressor (5:3) compressor Full Y2 propagation paths for Y2 11 Optimized Relay Compressor Design: Y2 Sub-circuit Truth table N 1 Y2 Y2Y1Y0 Ai A6 i 0 Y2 Y2 A4 A4 A3 Kill A3 A3 A3 A4 A2 A5 A5 A2 A2 A4 A4 A2 Generate A1 A1 A1 A3 A3 A3 A3 A1 A1 A0 A0 (5:3) compressor (5:3) compressor Full Y2 propagation paths for Y2 Optimized relay design: 0 mech. delay from A0 to Y2 5 mech. delay for a 32 bit multiplier 27% of CMOS style design A2 A2 A2 A1 A1 A0 (7:3) compressor Full Y2 11 Optimized Relay Compressor Design: Y1 Sub-circuit Truth table N 1 Y2Y1Y0 Ai Y1 i 0 A4 A3 A3 A2 A2 A2 A1 A1 A0 A1 A0 (5:3) compressor propagation paths for Y1 12 Optimized Relay Compressor Design: Y1 Sub-circuit Truth table N 1 Y2Y1Y0 Ai Y1 Y1 A4 A4 i 0 A2 A2 A0 A2 A2 A1 A1 A3 A3 A3 A3 A1 A0 A1 A1 Kill A2 A1 A0 A2 A1 A0 Generate A1 Kill (5:3) compressor (5:3) compressor propagation paths Full Y1 for Y1 12 Optimized Relay Compressor Design: Y1 Sub-circuit Truth table Y1 N 1 Y2Y1Y0 Ai A6 Y1 Y1 A4 A4 i 0 A2 A2 A4 A0 A2 A2 A1 A1 A3 A3 A3 A3 A1 A0 A1 A1 Kill A5 A5 A2 A1 A0 A2 A1 A0 Generate A4 A4 A3 A3 A3 A3 A2 A2 A2 A2 A1 A1 A1 A1 A1 Kill (5:3) compressor (5:3) compressor propagation paths Full Y1 for Y1 A0 A0 (7:3) compressor Full Y1 12 Optimized Relay Compressor Design: Y0 Sub-circuit Truth table N 1 Y2Y1Y0 Ai Y0 Y0 A6 A6 A5 A5 A4 A4 A3 A3 A2 A2 A1 A1 i 0 A0 A1 A0 A1 A1 A1 A0 A0 Two input XOR/XNOR gate A0 A0 (7:3) compressor Full Y0 13 MEM Relay Multiplier: Booth Encoding Booth Encoding Table Radix-4 Booth encoding Multiplier Bits Block Partial Product N2i+1N2iN2i-1 Halves the number of PP’s Adds one mech. delay to PP generation Reduces area and complexity for larger multipliers 000 001 010 011 100 101 110 111 N2i-1 N2i Mj N2i-1 N2i+1 N2i-1 Ni Mj PPij N2i 0 1*Multiplicand 1*Multiplicand 2*Multiplicand -2*Multiplicand -1*Multiplicand -1*Multiplicand 0 PPij Mj N2i-1 Simple AND network N2i-1 N2i N2i+1 Mj-1 N2i-1 Radix-4 Booth encoded 14 MEM-Relay Multiplier Delay/Area Tradeoffs Delay and relay count comparison Multiplier Half- and Full-adders Mech. Delay 8 bit 16 bit 32 bit 64 bit A B A B A B A B 4 5 5 6 6 7 7 8 12k 9k 29k 19k 5 5 5 6 26k 15k 57k 30k Relay Count 1.1k 1.4k 3.5k 3.4k 4 4 5 (7,3)compressor Mech. Delay 4 (and smaller) Relay Count 1.4k 1.5k 5.5k 4.4k (A) AND network PP generation (B) Booth-enabled PP generation 15 MEM-Relay Multiplier Delay/Area Tradeoffs Delay and relay count comparison Multiplier Half- and Full-adders Mech. Delay 8 bit 16 bit 32 bit 64 bit A B A B A B A B 4 5 5 6 6 7 7 8 12k 9k 29k 19k 5 5 5 6 26k 15k 57k 30k Relay Count 1.1k 1.4k 3.5k 3.4k 4 4 5 (7,3)compressor Mech. Delay 4 (and smaller) Relay Count 1.4k 1.5k 5.5k 4.4k (A) AND network PP generation (B) Booth-enabled PP generation Smaller compressors offer area efficiency, make the multiplier slower 15 MEM-Relay Multiplier Delay/Area Tradeoffs Delay and relay count comparison Multiplier Half- and Full-adders Mech. Delay 8 bit 16 bit 32 bit 64 bit A B A B A B A B 4 5 5 6 6 7 7 8 12k 9k 29k 19k 5 5 5 6 26k 15k 57k 30k Relay Count 1.1k 1.4k 3.5k 3.4k 4 4 5 (7,3)compressor Mech. Delay 4 (and smaller) Relay Count 1.4k 1.5k 5.5k 4.4k (A) AND network PP generation (B) Booth-enabled PP generation Smaller compressors offer area efficiency, make the multiplier slower Larger compressors offer higher performance, especially for big multipliers Booth encoding reduces the area for larger multipliers 15 (7:3) Compressor in MEM-Relay Test Chip Y2 D G A6 G B B G A5 A5 G S A4 A4 A4 150µm A2 A3 A3 A3 A3 A2 A2 A1 A1 A0 Y1 Y0 Y0 A6 A6 A5 A5 A4 A4 A3 A3 A6 A4 Y1 A5 A5 Y2 A4 A4 Y0 A3 A3 A3 A3 A2 A2 A2 A2 A2 A2 A1 A1 A1 A1 A1 A1 A0 A0 A0 A0 9mm 16 (7:3) Relay Compressor Test Results Input code Y0(V) Y1(V) Y2 (V) (7:3) compressor sub-circuits output 4 2 0 4 2 0 4 2 0 100 50 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (S) Output code (7:3) compressor input vs. output code 7 Correct result Experiment 5 3 1 0 20 40 60 80 100 120 Input code Largest working MEM-Relay circuit reported to date (98 relays) 17 16 bit Multiplier Delay/Energy Tradeoffs Scaled MEM Relay CMOS OTCT (90nm) 3 CMOS Dadda/HC (45nm) 10 ~5-10x 16X Parallel D VD 0.5V ß 2 1V Energy/op (fJ) 10 4 10 1 10 0 10 1 10 2 10 10 3 Delay(ns) Relays simulations: Predictive model of 90nm-equivalent relay CMOS simulations: Voltage scaling in 0.7-1.3V range Dadda/HC: placed-and-routed using Nangate 45nm Standard Cell Library OTCT: Intel’s Optimally Tiled Compressor Tree (Hsu et al., JSSC 2006) 18 Conclusions MEM relays offer zero leakage and a potential to have lower minimum E/op than CMOS With circuit optimization techniques, relay energy benefits extend to most complex arithmetic blocks Relay multiplier topologies trade-off area and delay Demonstrated largest functional relay circuit to date Next steps: scaling and improving device design Realize the predicted ~5-10x energy-efficiency benefits over CMOS in ~10 MOPS performance range 19 Acknowledgements Relay team: Matthew Spencer, Elad Alon, Hei Kam, Vincent Pott, Chencheng Wang, Kevin Dwan and Dejan Marković DARPA NEMS program BWRC Sponsors MIT CICS Focus Center Research Program (C2S2, MSD) NSF Infrastructure Grant #0403427 20