Download talk

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Design and Demonstration of
Micro-Electro-Mechanical Relay
Multipliers
Hossein Fariborzi1 ,Fred Chen1, Rhesa Nathanael2,
Jaeseok Jeon2, Tsu-Jae King Liu2,
and Vladimir Stojanovic1
1Massachusetts
Institute of Technology, Cambridge, MA, USA
2University of California, Berkeley, CA, USA
RQE Meeting 5/26/2011
Number of cores
CMOS Scaling and Power Consumption
Intel 48 core -Xeon
 Since 2000, VDD and VT are not scaling well
Power/area increases with each new process generation
 Parallelism allows slower, more efficient units
 Helps improve performance for fixed power
 Enables lower power for the same throughput
2

25
Normalized Energy/op
Normalized Energy/op
Subthreshold Leakage: Game-Over for CMOS
20
15
10
Etotal
Edynamic
5
0.1
Eleakage
0.2
0.3
0.4
VDD (V)
0.5
More
parallelism
does not help
1x
11
10
9
8x
2x
8
0
1 2 3 4 5
1/throughput
6
 Leakage and sub-threshold slope define minimum
energy/op for CMOS
 Parallelism cannot reduce energy/op if already operating
at minimum energy
3
MEM Relays Offer Zero Leakage
Measured MEM Relay I-V Curve
MEM Relay Energy vs. VDD
Etotal=Edynamic
 MEM Relays show zero leakage & sharp sub-threshold slope
 Could potentially enable reduced VDD and E/op with scaling
R. Nathanael et al., IEDM 2009
4
MEM Relay Structure and Operation
Tungsten Channel
Tungsten Body
Poly-SiGe Gate
Poly-SiGe Anchor
Poly-SiGe Beam
/Flexure
Tungsten Source/Drain
OFF Switch:
|Vgb| < Vpo (pull-out voltage)
ON Switch:
|Vgb| > Vpi (pull-in voltage)
5
MEM Relay as a Logic Element
Gate
Source
Body
Drain
 4-terminal design mimics MOSFET operation
 Electrostatic actuation is ambipolar
 Non-inverting logic is possible
 Actuation independent of source/drain voltages
6
Digital Circuit Design with MEM Relays
4 gate delays, 18 transistors
1 mechanical
delay, 8 relays
Relays
CMOS
A0
A1
A2
A1
A3
out
out A0
A2
A3
A2
A1
A3
F. Chen et al., ICCAD08
Electrical vs. mechanical delay for MEM-Relays
-6
VGB=VPI
VGB=3VPI
10
Delay (s)
-7
VGB=2VPI
10
VGB=5VPI
-8
10
Elec. delay (current)
-9
10
Mech. delay (current)
-10
10
Elec. delay (Predictive model)
Mech. delay (Predictive model)
-11
10
10
0
10
1
10
2
Stack length
10
3
7
MEM Relay Multiplier: Basic Concept
1 0 1 0 1 0
x
1 0 1 1
1 0 1 0 1 0
1 0 1 0 1 0
0 0 0 0 0 0
+
1 0 1 0 1 0
1 1 1 0 0 1 1 1 0
Multiplier
 Design tradeoff: Number of stages
Multiplicand
Multiplicand
(i.e., mechanical delays) vs. area
Multiplier
 Example: 32 bit relay multiplier
Partial
Products
Partial products
(PP)
Result
Result

CMOS style design:


19 mech. delay, 26k relays
Optimized Relay designs:

5-6 mech. delay, 11k-20k relays
2 possible scenarios
Using full- and half-adders along
the columns
Using larger compressors along the
columns to achieve higher compression
8
MEM Relay Multiplier: 6-bit Example
Multiplier
Multiplicand
Partial Product Generation matrix
Mech. Propagation
Elec. Propagation
 Approach 1: Fulland Half-adders
 3 mech. Delay
 430 relays
LSB
(N:3)
Compressor
Mech.
delay
N
FA
12-bit Multiplication result
Partial Product Generation matrix
 Approach 2: Large
compressors
 2 mech. Delay
 595 relays
4
5
6
7
6
5
12-bit Multiplication result
HA
LSB
9
(7:3) Compressor
CMOS Pass-gate Style Design
Truth table
N 1
Y2Y1Y0   Ai
i 0
Y2
Y2
A6 A6
(7:3) compressor
Full Y2, CMOS(1)
A5
A5
A5
A5
A4
A4
A4 A4
A4
A4
A3
A3
A3 A3
A3
A3
A3 A3
A2
A2
A2
A2
A2
A2
A1
A1
A1 A1
 Relay implementation of CMOS pass-gate compressor: A0
 No propagation path accumulating mech. delays
in stages
 19 mech. delay for a 32 bit multiplier
(1) Song et al, JSCC 91
A0
10
Optimized Relay Compressor Design:
Y2 Sub-circuit
Truth table
N 1
Y2Y1Y0   Ai
i 0
11
Optimized Relay Compressor Design:
Y2 Sub-circuit
Truth table
N 1
Y2Y1Y0   Ai
i 0
Y2
A4
A3
A2
A1
A3
A2
A1
A0
(5:3) compressor
propagation paths
for Y2
11
Optimized Relay Compressor Design:
Y2 Sub-circuit
Truth table
N 1
Y2Y1Y0   Ai
i 0
Y2
Y2
A4
A4
A3
A2
Kill
A3
A3
A2
A2
A3
A2
Generate
A1
A1
A1
A1
A1
A0
A0
(5:3) compressor (5:3) compressor
Full Y2
propagation paths
for Y2
11
Optimized Relay Compressor Design:
Y2 Sub-circuit
Truth table
N 1
Y2
Y2Y1Y0   Ai
A6
i 0
Y2
Y2
A4
A4
A3
Kill
A3
A3
A3
A4
A2
A5
A5
A2
A2
A4
A4
A2
Generate
A1
A1
A1
A3
A3
A3
A3
A1
A1
A0
A0
(5:3) compressor (5:3) compressor
Full Y2
propagation paths
for Y2
 Optimized relay design: 0 mech. delay from A0 to Y2
 5 mech. delay for a 32 bit multiplier  27% of
CMOS style design
A2
A2
A2
A1
A1
A0
(7:3) compressor
Full Y2
11
Optimized Relay Compressor Design:
Y1 Sub-circuit
Truth table
N 1
Y2Y1Y0   Ai
Y1
i 0
A4
A3
A3
A2
A2
A2
A1
A1
A0
A1
A0
(5:3) compressor
propagation paths
for Y1
12
Optimized Relay Compressor Design:
Y1 Sub-circuit
Truth table
N 1
Y2Y1Y0   Ai
Y1
Y1
A4
A4
i 0
A2
A2
A0
A2
A2
A1
A1
A3
A3
A3
A3
A1
A0
A1
A1
Kill
A2
A1
A0
A2
A1
A0
Generate
A1
Kill
(5:3) compressor (5:3) compressor
propagation paths
Full Y1
for Y1
12
Optimized Relay Compressor Design:
Y1 Sub-circuit
Truth table
Y1
N 1
Y2Y1Y0   Ai
A6
Y1
Y1
A4
A4
i 0
A2
A2
A4
A0
A2
A2
A1
A1
A3
A3
A3
A3
A1
A0
A1
A1
Kill
A5
A5
A2
A1
A0
A2
A1
A0
Generate
A4
A4
A3
A3
A3
A3
A2
A2
A2
A2
A1
A1
A1
A1
A1
Kill
(5:3) compressor (5:3) compressor
propagation paths
Full Y1
for Y1
A0
A0
(7:3) compressor
Full Y1
12
Optimized Relay Compressor Design:
Y0 Sub-circuit
Truth table
N 1
Y2Y1Y0   Ai
Y0
Y0
A6
A6
A5
A5
A4
A4
A3
A3
A2
A2
A1
A1
i 0
A0  A1
A0
A1
A1
A1
A0
A0
Two input XOR/XNOR gate
A0
A0
(7:3) compressor
Full Y0
13
MEM Relay Multiplier: Booth Encoding
Booth Encoding Table
 Radix-4 Booth encoding



Multiplier Bits Block Partial Product
N2i+1N2iN2i-1
Halves the number of PP’s
Adds one mech. delay to PP
generation
Reduces area and complexity for
larger multipliers
000
001
010
011
100
101
110
111
N2i-1
N2i
Mj
N2i-1
N2i+1
N2i-1
Ni
Mj
PPij
N2i
0
1*Multiplicand
1*Multiplicand
2*Multiplicand
-2*Multiplicand
-1*Multiplicand
-1*Multiplicand
0
PPij
Mj
N2i-1
Simple AND network
N2i-1
N2i
N2i+1
Mj-1
N2i-1
Radix-4 Booth encoded
14
MEM-Relay Multiplier Delay/Area Tradeoffs
Delay and relay count comparison
Multiplier
Half- and
Full-adders
Mech. Delay
8 bit
16 bit
32 bit
64 bit
A
B
A
B
A
B
A
B
4
5
5
6
6
7
7
8
12k
9k
29k
19k
5
5
5
6
26k
15k
57k
30k
Relay Count 1.1k 1.4k 3.5k 3.4k
4
4
5
(7,3)compressor Mech. Delay 4
(and smaller) Relay Count 1.4k 1.5k 5.5k 4.4k
(A) AND network PP generation (B) Booth-enabled PP generation
15
MEM-Relay Multiplier Delay/Area Tradeoffs
Delay and relay count comparison
Multiplier
Half- and
Full-adders
Mech. Delay
8 bit
16 bit
32 bit
64 bit
A
B
A
B
A
B
A
B
4
5
5
6
6
7
7
8
12k
9k
29k
19k
5
5
5
6
26k
15k
57k
30k
Relay Count 1.1k 1.4k 3.5k 3.4k
4
4
5
(7,3)compressor Mech. Delay 4
(and smaller) Relay Count 1.4k 1.5k 5.5k 4.4k
(A) AND network PP generation (B) Booth-enabled PP generation
 Smaller compressors offer area efficiency, make the multiplier slower
15
MEM-Relay Multiplier Delay/Area Tradeoffs
Delay and relay count comparison
Multiplier
Half- and
Full-adders
Mech. Delay
8 bit
16 bit
32 bit
64 bit
A
B
A
B
A
B
A
B
4
5
5
6
6
7
7
8
12k
9k
29k
19k
5
5
5
6
26k
15k
57k
30k
Relay Count 1.1k 1.4k 3.5k 3.4k
4
4
5
(7,3)compressor Mech. Delay 4
(and smaller) Relay Count 1.4k 1.5k 5.5k 4.4k
(A) AND network PP generation (B) Booth-enabled PP generation
 Smaller compressors offer area efficiency, make the multiplier slower
 Larger compressors offer higher performance, especially for big multipliers
 Booth encoding reduces the area for larger multipliers
15
(7:3) Compressor in MEM-Relay Test Chip
Y2
D
G
A6
G
B
B
G
A5
A5
G
S
A4
A4
A4
150µm
A2
A3
A3
A3
A3
A2
A2
A1
A1
A0
Y1
Y0
Y0
A6
A6
A5
A5
A4
A4
A3
A3
A6
A4
Y1
A5
A5
Y2
A4
A4
Y0
A3
A3
A3
A3
A2
A2
A2
A2
A2
A2
A1
A1
A1
A1
A1
A1
A0
A0
A0
A0
9mm
16
(7:3) Relay Compressor Test Results
Input code Y0(V)
Y1(V)
Y2 (V)
(7:3) compressor sub-circuits output
4
2
0
4
2
0
4
2
0
100
50
0
0
0.5
1
1.5
2
2.5
3
3.5
4
Time (S)
Output code
(7:3) compressor input vs. output code
7
Correct result
Experiment
5
3
1
0
20
40
60
80
100
120
Input code
Largest working MEM-Relay circuit reported to date (98 relays)
17
16 bit Multiplier Delay/Energy Tradeoffs
Scaled MEM Relay
CMOS OTCT (90nm)
3
CMOS Dadda/HC (45nm)
10
~5-10x
16X Parallel
D
VD 0.5V
ß
2
1V
Energy/op (fJ)
10 4
10
1
10 0
10
1
10
2
10
10
3
Delay(ns)
 Relays simulations: Predictive model of 90nm-equivalent relay
CMOS simulations: Voltage scaling in 0.7-1.3V range
Dadda/HC: placed-and-routed using Nangate 45nm Standard Cell Library
 OTCT: Intel’s Optimally Tiled Compressor Tree (Hsu et al., JSSC 2006)
18

Conclusions
 MEM relays offer zero leakage and a potential to have
lower minimum E/op than CMOS
 With circuit optimization techniques, relay energy
benefits extend to most complex arithmetic blocks


Relay multiplier topologies trade-off area and delay
Demonstrated largest functional relay circuit to date
 Next steps: scaling and improving device design

Realize the predicted ~5-10x energy-efficiency benefits over
CMOS in ~10 MOPS performance range
19
Acknowledgements
 Relay team:
 Matthew Spencer, Elad Alon, Hei Kam, Vincent Pott,
Chencheng Wang, Kevin Dwan and Dejan Marković





DARPA NEMS program
BWRC Sponsors
MIT CICS
Focus Center Research Program (C2S2, MSD)
NSF Infrastructure Grant #0403427
20
Related documents