Download Building Modern Integrated Systems: A Cross-cut Approach (The Electrical, The Optical and The Mechanical)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Immunity-aware programming wikipedia , lookup

Telecommunications engineering wikipedia , lookup

Protective relay wikipedia , lookup

Microprocessor wikipedia , lookup

Flexible electronics wikipedia , lookup

Regenerative circuit wikipedia , lookup

Electronic engineering wikipedia , lookup

Semiconductor device wikipedia , lookup

Opto-isolator wikipedia , lookup

CMOS wikipedia , lookup

Transcript
Building Modern Integrated Systems:
A Cross-cut Approach
(The Electrical, The Optical and The Mechanical)
Vladimir Stojanović
Integrated Systems Group
Massachusetts Institute of Technology
Chip design is going through a change


Already have more devices than can use at once
Limited by power density and bandwidth
Intel Network Processor
1 GPP Core
16 ASPs (128 threads)
18
Stripe
RDRA
M
1
18
RDRA
M
2
64b
(64b)
66
MHz
QDR
SRAM
1
E/D Q
1 1
8 8
18
RDRA
M
3
Intel®
XScale
™
Core
PCI
32K IC
32K DC
QDR
SRAM
2
E/D Q
1 1
8 8
MEv2
1
G
A
S
K
E
T
QDR
SRAM
3
E/D Q
1 1
8 8
QDR
SRAM
4
E/D Q
1 1
8 8
Sun Niagara
8 GPP cores (32 threads)
Intel 4004 (1971):
4-bit processor,
2312 transistors,
~100 KIPS,
10 micron PMOS,
11 mm2 chip
Asanovic
IBM Cell
1 GPP (2 threads)
8 ASPs
MEv2
2
MEv2
3
MEv2
4
MEv2
8
MEv2
7
MEv2
6
MEv2
5
MEv2
9
MEv2
10
MEv2
11
MEv2
12
MEv2
16
MEv2
15
MEv2
14
MEv2
13
IXP280
0
Rbuf
64 @
128B
S
P 16b
I
4
or
C
S
16b
I
X
Tbuf
64 @
128B
Hash
48/64/1
CSRs
28
-Scratc
Fast_wr
h
-UART
16KB
Timers
-GPIO
BootROM/Sl
owPort
Picochip DSP
1 GPP core
248 ASPs
Cisco CSR-1
188 Tensilica GPPs
1000s of
processor “The Processor is
cores and the new Transistor”
[Rowen]
accelerators
per die
2
Subthreshold leakage: Game over for CMOS
Energy/op vs. 1/throughput
25
100
20
80
Normalized Energy/op
Normalized Energy/cycle
Energy/op vs. Vdd
15
Etotal
10
Edynamic
5
Eleak
0.1

0.2
0.3
Vdd (V)
0.4
0.5
60
Scale Vdd & VT:
40
20
0 1
10
2
10
3
4
5
10
10
1/throughput (ps/op)
10
CMOS circuits have well-defined minimum energy



Caused by leakage and finite sub-threshold swing
Need to balance leakage and active energy
Limits energy-efficiency, regardless how slow the circuit runs
3
Wire and I/O scaling
I/O
On-chip wires
Best electrical links
On-chip wires
copper resistivity
Energy-cost [pJ/b]
18
16
Chip2Chip
Backplane
Loss ~20-25dB
14
12
10
8
6
4
Loss ~10dB
2
0
0
5
10
15
20
25
Data-rate [Gb/s]


Increased wire resistivity makes wire caps scale very slowly
Can’t get both energy-efficiency and high-data rate in I/O
4
Opportunity for integrated system design
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Network &
µArchitecture
Design
Optimization
Communications
(Eq., Mod, Coding)
2.5
Energy/Bit (pJ/Bit)
2
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
Circuit modeling,
Characterization
1.5
1
0.5
0
0
1
2
Data Rate Density (Gbps/um)
Φ
Φ
Circuits & Logic
Tx, Rx, Ctrl, Meas
Φ
in-
in+
IPHOTO
Φ
Φ
Φ
3
Interconnect
and switch
technology
Cu
MOSFET
5
Manycore SOC roadmap fuels
bandwidth demand
64-tile system (64-256 cores)
- 4-way SIMD FMACs @ 2.5 – 5 GHz
- 5-10 TFlops on one chip
- Need 5-10 TB/s of off-chip I/O
- Even higher on-chip bandwidth
2 cm
2 cm
Intel 48 core -Xeon
6
Cross-layer design approach
Build modeling tools for design-space exploration and vertical
integration
Apps
OS
ISA
Rep..
Power
Manycore
hardware
Offered BW
NoC topologies
Offered BW
NoC metrics
Routers, NoC
Eq . , Width
Eq . , Space
Rep . , Width
Rep . , Space
3
2. 5
Equalized , 30 mV Eye
Equalized , 50 mV Eye
Equalized , 90 mV Eye
Repeated
E nergy /B it (pJ /B it)
2
1
Channel Technologies
Throughput Density
2
3
di
( Gbps / um )
Vp
Vp
D
Link design parameters
Td
Vp
D
Vs
w0
+
-
w1
Mux
Mux
Register
Register
Vth
-
Wth
1
y1
Sp
D
Vs
0
Vth
-
w2
PLL or
Opt. Clk
in
1
2
3
4
in
1
2
3
Mod-DriverMod-Driver
Pre-Driver Pre-Driver
-y1
Phase
Adjust
4
Phase
PLL or
Adjust
Opt. Clk
Φ Φ
+
Φ
0
1
Data Rate Density
2
( Gbps / um )
3
Link metrics
PLL or
Opt. Clk
Receiver Receiver
Samplers &
Samplers &
Front-end Front-end
MonitoringMonitoring
Φ
1
0
clk
+
1. 5
0. 5
d̂ i
+
Vs
+
PLL or
Opt. Clk
clk
WLCM
+
-
2
Register
1
Demux
0
Register
Demux
Wire Widt h and S pa ce (u m)
Eq.
Φ
+
Φ
Φ
Φ
Φ
Φ
7
Physical modeling – Equalized interconnects
Data rate density,
latency, eye opening,
sampling delay(Td)
Energy-per-bit
(Eb)
Equalization
coefficient: w, y1
Link
power model
Link architecture:
FFE, DFE tap numbers
Link
performance
model
Transfer function:
T(f), Tc(f)
target
data rate
density
Channel model
RLGC
parameters
Wire Model
Capturing the
wire+ circuit
interactions
2D RLGC matrices
database
Wth, Sp
2D field
solver
Huge design-space
target
R, C model
wire
length: l
for LCM & Inverter
Circuit Model
Normalized
R(Ohm-um), C(fF/um)
switch model database
Linearized
RC swtich
extraction
Technology information
Wth, Sp
Transistor: spice model
Wire: metal conductance,
dielectric8constant, etc.
Circuit type:
LCM|Inverter,
WLCM, Vs, Vp
Kim and Stojanovic
ICCAD07,
D&T 2008
Sredojevic and Stojanovic
ICCAD08
Circuit type:
LCM | Inverter
WLCM, Vs, Vp
8
Optimized on-chip links
2.5
+
2
D0B
D0
Energy/Bit (pJ/Bit)
-
60uA
P2_P
IBIAS
60uA
IBIAS
I0
80uA
P1_P
Weak Driver
N1_P
+
-
4.3u
P2_P
27u
A<4:0>
+
I1
I2
N2_P
1.75u
9.4u
Decoding Block
D
1.5
10mm wire
1
0.5
A<19:0>
20
Transition Signals :
P1_P, P2_P, N1_P, N2_P
D
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
8
+
0
0
Effective
Receiver
Admittance
-
1
2
Data Rate Density (Gbps/um)
Strong Driver
20 A<19:0>
Amplitude Control
voltage
swing
channel attenuation
distance

Energy-efficient digital pre-emphasis

90nm CMOS
Nonlinear predistortion, mismatch robustness
Kim and Stojanovic, ISSCC09, JSSC June 2010
3
mV
200
Optimized off-chip links
100
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.8
2
Oversampled
Discrete-Time
Rx 1.6
Equalizer
- No need for(b) CDR
Adaptive
Eq
FSE Output–
Eyeonly
Openning
@4Gbps
Digital Tx Equalizer - Energy-efficient
Dynamic Impedance Modulation
400
Bit sequence
Driver Linearization
Sequence Coding
300
LUT 16 x 6b + 1b
E[]
Sign
S
Output
Voltage
VDD
VDD
1
1
0
Therm. Code
1
1
[63:0] 1 0
0
0
0
1
0
mV
4bit
Pattern
dependent
code
Static transfer curve [Vdd]
Serializer
0.8
0.7
I1 = 0
I2 = 15.5u
FSE output
I1 = 30.0u eye openning
I2 = 0
single-tap output
eye openning
I1 = 20.0u
I2 = 4.5u
200
0.45
100
0.2
-0.05
0
0
-0.3
0.2
0.4
-0.55
-0.8
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Delay between Data and CLK (Data Cycle - UI)
Φ225Φ270Φ315
-60
-40
-20
0
20
40
60
ΦS1
V2T RF
ΦEVA
V2T
R
F
ΦEVA
ΦS1
V2T RF
Vin-
ΦEVA
ΦS2
V2T RF
ΦEVA_
TOF2+
VO+
2-tap
T2V
TOF1-
I1
TOR2-
I2
CONF0
Sign(Vesref, Way3)
-α
Sign(V
esref, Way2)
Sign(V
)
Sign(V
esref, Way1)ssref, Way3
Sign(ES[n],way
0) ssref, Way2)
Sign(V
Sign(V
)
TOF2-
ssref, Way1
Sign(d0[n],way0),Sign(d1[n],way0)
880mV
Sensors for Adaptation
Equalized
Way2
Way0
Iss1,2,ref
SI
SO
ScanChain
Way3
Way1
<1pJ/b @ 4Gb/s
90nm CMOS
CONF
Way0
CONF0
~b k,Way0
MUX
VO-
880mV
Unequalized
CONF
Way1
+α
TOF1+
TOR1-
ScanChain
ScanChain
ScanChain
ScanChain
ScanChain
ScanChainSnapShot
ScanChain
ScanChainSnapShot
SnapShot
ScanChainScanChain
SnapShot
ScanChain
ScanChainScanChain
SnapShot SnapShot
ScanChain
SnapShot
SnapShot
SnapShot SnapShot
ScanChain
SnapShot
SnapShot
SnapShot
ScanChain ScanChain
SnapShot
SnapShot
ScanChain
ScanChain
SnapShot
SnapShot
ScanChain
SnapShot
ScanChain
CONF
Way2
CONF1
Feedback Equalizer
(DFE)
TOR1+
TOR2+
~b k,Way0
ΦEVA
Feedforward Equalizer
ΦEVA(FSE)
ΦS2
300mV
300mV
~b k,Way1
ΦS1ΦS2ΦEVA
Way0
Way3
~b k,Way2CONF2
Φ315Φ0Φ45 ΦS1ΦS2ΦEVA
Way1
480
480
mVmV
CONF3
Φ45Φ90Φ135 ΦS1ΦS2ΦEVA
Way2
Vin+
~b k,Way3
Φ135Φ180Φ225 ΦS1ΦS2 ΦEVA
Way3
Memory Code
CONF3
DAC
DAC
DAC I
2 7-bit DACs
i1,ref
7-bit DAC
Ii2,ref
7-bit DAC
DAC
DAC
DAC
I
I
es,ref
7-bit DAC
+α,ref
7-bit DAC
CONF2
CONF1
CONF0
I+α,ref
7-bit DAC
Iref
Transmitter
Scan-chain
Sredojevic and Stojanovic,
CICC10, JSSC Aug 2011
5b linear
3pJ/b @ 6Gb/s
90nm CMOS
Song and Stojanovic, VLSI09, JSSC May 2011
Bandwidth, pin count and power scaling
Package pin count
256 cores
2,4 cores
1 Byte/Flop
*> half pins for power supply
Need 16k pins
in 2017 for HPC*
2 TFlop/s signal pins @ 20 Gb/s/link
Emerging devices can help
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Network &
µArchitecture
Post-CMOS – need cross-cut
approach to guide new
devices/systems
Design
Optimization
Communications
(Eq., Mod, Coding)
2.5
Energy/Bit (pJ/Bit)
2
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
Circuit modeling,
Characterization
1.5
1
0.5
0
0
1
2
Data Rate Density (Gbps/um)
Φ
Φ
Circuits & Logic
Tx, Rx, Ctrl, Meas
3
Interconnect
and switch
technology
Cu
MOSFET
Si-Photonics
Φ
in-
in+
IPHOTO
Φ
Φ
Φ
CMOS photonics density and energy
advantage
Metric
Energy
(pJ/b)
Bandwidth
density
(Gb/s/μ)
Global on-chip photonic link
0.1-0.25
160-320
Global on-chip optimally repeated electrical link
1
5
Off-chip photonic link (100 μ coupler pitch)
0.1-0.25
6-12
Off-chip electrical SERDES (100 μ pitch)
5
0.1
Assuming 128 x 10Gb/s wavelengths on each waveguide, and 20Gb/s electrical I/O
13
Monolithic Si-Photonics for core-to-core and
core-to-DRAM networks
Supercomputers
Si-photonics in advanced
CMOS and DRAM process
NO costly process changes
Embedded apps
Bandwidth density – need dense WDM
Energy-efficiency – need monolithic integration
14
14
Many architectural studies show promise
[Shacham’07]
[Petracca’08]
[Koka’08-10]
[Joshi’09]
[Pan’09]
[Batten’08]
[Vantrease’08]
[Psota’07]
[Kirman’06]
[Beamer’10]
15
2
3
4
in
1
2
3
Phase
Adjust
4
Mod-Driver
Mod-Driver
Pre-Driver
PLL or
Phase
Opt. Clk
Adjust
PLL or
Opt. Clk
Receiver
Samplers &
Receiver
Samplers &
Front-end
Monitoring
Front-end
Monitoring
Φ
Φ
+
Φ
Φ
Φ
Φ
Φ
+
Φ
Φ
Φ
Dense WDM – 128 wavelengths/waveguide - >1Tb/s per waveguide
Need 1000’s of transceivers on die with < 100fJ/bit cost at > 10Gb/s !
- Optimized modulator circuits/devices
- Optimized receiver circuits/photo-detector
- Optimized thermal tuning
16
Register
1
Register
Demux
Pre-Driver
in
PLL or
Opt. Clk
Demux
Mux
PLL or
Opt. Clk
Mux
Register
Register
Big Challenge: Efficient integration with
circuits in advanced CMOS process
Need to optimize carefully
2
3
4
in
1
2
3
Phase
Adjust
4
Mod-Driver
Mod-Driver
Pre-Driver
PLL or
Phase
Opt. Clk
Adjust
PLL or
Opt. Clk
Receiver
Samplers &
Receiver
Samplers &
Front-end
Monitoring
Front-end
Monitoring
Φ
Φ
+
Φ
Φ
Φ
Φ
Φ
+
Φ
Φ
Φ
assuming 32nm CMOS

Laser energy increases with data-rate
–
–


Limited Rx sensitivity
Modulation more expensive -> extinction ratio / insertion loss trade-off
Tuning costs decrease with data-rate
Moderate data rates most energy-efficient
Georgas CICC 2011
Register
1
Register
Demux
Pre-Driver
in
PLL or
Opt. Clk
Demux
Mux
PLL or
Opt. Clk
Mux
Register
Register
512 Gb/s aggregate throughput
DWDM link efficiency optimization
Electrical
bump-pitch
limited to
<1Tb/s/mm2
>10x


Package pin limit
0.05 Tb/s/mm2
Optimize for min energy-cost
Bandwidth density dominated by circuit and
photonics area (not coupler pitch)


10x better than electrical bump limited
200x better than electrical package pin limit
18
Photonic DRAM Network Organization
Super DIMM
Laser in
CPU
DRAM cube 1
DRAM cube 4
MC 1
Dwr
Drd
MC K
Important Concepts
- Power/message switching
(only to active DRAM chip in
DRAM cube/super DIMM)
- Vertical die-to-die coupling
cmd
Drd
Dwr
(minimizes cabling - 8 dies per
DRAM cube)
Mem Scheduler
( cube 1, die 8)
die-die
switch
cmd
Dwr
Drd
( cube 1, die 1)
-Command distributed
electrically (broadcast)
- Data photonic (single writer
multiple readers)
Super DIMM K
DRAM cube 4
Modulator bank
MC 16
Processor die
Receiver/PD bank
Tunable filterbank
Through silicon via
Through silicon via hole
Beamer ISCA 2010
Optimizing DRAM with photonics
P1
Floorplan
P4
Beamer ISCA 2010
Through loss (dB/ring)
Cross-layer modeling identifies key device
requirements
Feedback to device designers
Waveguide loss (dB/cm)
Optical Laser Power
Die Area Overhead
Waveguide loss and Through loss limits for 2 W optical laser power
21
Significant integration activity,
but hybrid and older processes …
130nm
thick BOX SOI
130nm
thick BOX SOI
[IBM]
[Luxtera/Oracle/Kotura]
[Many schools]
Bulk CMOS
Backend
monolithic
[Intel]
[HP]
[Watts/Sandia/MIT]
[Lipson/Cornell]
[Kimerling/MIT]
22
Monolithic CMOS photonic integration
Optical Mode
Photo credit: Intel

Polysilicon - transistor gates, local interconnect and resistors


Use for photonic components instead or with silicon body in SOI
Sub-100nm lithography has 1-5 nm design grid

Enables edge roughness necessary for photonic devices
23
EOS Platform for Monolithic CMOS
2011
photonic integration
Joint work with Ram and Popovic
2007
0
Transmission, dB
-2
-4
-6
-8
45 nm SOI CMOS
IBM 12SOIs0
-10
-12
-14
-200
0
200
400
600
800
1000
Frequency, GHz
32 nm bulk CMOS
Texas Instruments
90 nm bulk CMOS
IBM cmos9sf
65 nm bulk CMOS
Texas Instruments
Create integration platform to accelerate
technology development and adoption
24
A 32nm bulk CMOS photonic platform

Monolithic CMOS photonic platform integrated with CMOS circuits



32nm process – fabrication support from Texas Instruments
Robust post-processing steps at MIT
Second-order resonator filterbank shows process precision


Great on-die matching (rings track within 40GHz)
Record thermal heating efficiency 25uW/K
Orcutt et al – CLEO 2008, Optics Express 2011
25
in
1
2
3
Phase
Adjust
4
PLL or
Opt. Clk
Register
PLL or
Opt. Clk
Demux
Register
EOS: A 45nm SOI Monolithic Photonic Platform
Mux
Receiver
PolysiliconPre-Driver
and Silicon
Photonics on
ThinSamplers
BOX& IBM SOI
Mod-Driver
Front-end
Monitoring
Φ
Φ
+
Φ
Φ
Φ
6 rows of electronic-photonic
WDM links with
body and polysilicon
photonic devices
54 Transmit-receive testsites,
~3M transistors and
hundreds of photonic devices
Electrical and photonic integration – test row
Body and polysilicon photonic devices
Filterbanks, waveguide paperclips, rings, standalone modulators and photodetectors
26
Integration of photonics into VLSI tools
Layout of
photonics
abstract
Layout of
Circuit blocks
abstract
LEF
LEF
LEF of standard cells, I/O pads
(provided by ARM)
layout
modulator.LEF
VERSION 5.6 ;
BUSBITCHARS "[]" ;
DIVIDERCHAR "/" ;
MACRO block_electronic_etch_row_1
CLASS BLOCK ;
ORIGIN -208 -1794 ;
FOREIGN block_electronic_etch_row_1 208 1794 ;
SIZE 2488 BY 165 ;
SYMMETRY X Y R90 ;
PIN heater_a_1
DIRECTION INOUT ;
USE SIGNAL ;
PORT
LAYER ua ;
RECT 431 1870.5 436.5 1882 ;
END
END heater_a_1
...
OBS
LAYER m1 ;
RECT 208 1794 2696 1959 ;
...
END
END block_electronic_etch_row_1
Chip-level verilog
(instantiation of
.LEF macros and
connectivity)
Floorplan
(macro placement,
power grid, routing
Constraints)
SOC Encounter
Place and route
Place&routed
layout
Technology files
END LIBRARY
abstract
Photonic device
p-cell
custom photonics-friendly auto-fill
27
Platform Organization
28
A full electro-optical test setup
Fiber Positioner
Microscope
DUT
Fiber
Positioner
Chip
Board
HS
Clocks
Control
Board
FPGA
USB to laptop
29
Extremely good dimensional tolerances
in 45nm SOI

Good body waveguide loss

3.7dB/cm at ~1220nm
30
Integrated Delta-Sigma Heat Control
~10mW required
to retune all 8 rings
Thermal tuning BW
lower than 500kHz
Tuning control overhead
negligible

Tuning efficiency 2.6mW/nm (32.4mW/2π)

On fully substrate removed die
31
Current-sensing optical data receiver
Receiver detects photo current
50fJ/b, uA sensitivities, 3-5Gb/s
Georgas ESSCIRC 2011
32
Receiver sensitivity
Φ
Φ
Φ
in-
in+
IPHOTO
Φ
Φ
Φ
Exponential Dependence
on Wire Capacitance
Linear Dependence on
Photo-Detector
Capacitance
33
Modulator test site
Silicon carrier injection modulator
monolithically integrated with
transistors
•
•
•
•
Extinction ratio 19dB
45GHz 3dB bandwidth
Carrier lifetime ~2-3ns
Requires flexible drive circuits
• Sub-bit pre-emphasis
• Split-supplies
45 GHz
3 dB bandwidth
19 dB
extinction
First ever dynamic electro-optic test in
45nm SOI
Silicon carrier injection modulator
monolithically integrated with
transistors Modulator Driver
Modulator

Modulation data-rate up to 1Gb/s


5-10 Gb/s achievable with device and biasing optimization
Lots of room to improve circuit/device designs
Transistors and Photonics can be built together in
advanced CMOS!
35
Improving computation efficiency
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Network &
µArchitecture
Post-CMOS – need cross-cut
approach to guide new
devices/systems
Design
Optimization
Communications
(Eq., Mod, Coding)
2.5
Energy/Bit (pJ/Bit)
2
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
Circuit modeling,
Characterization
1.5
1
0.5
0
0
1
2
Data Rate Density (Gbps/um)
Φ
Φ
Circuits & Logic
Tx, Rx, Ctrl, Meas
IPHOTO
Φ
Interconnect
and switch
technology
MOSFET
in-
in+
3
Cu
Φ
Si-Photonics
NEMS relay
Φ
Φ
Nano-electro-mechanical (NEM) relays
Joint work with T-J. King Liu, E. Alon and D. Markovic (UCB, UCLA)
30mm
Gate
Oxide
90nm
Body
Drain A
Gate
Body
2 7.
5mm
A’
Source
Channel
Relay schematic

Nearly ideal switching characteristics:


Low on-state resistance (Ron <1kΩ)
Infinite off-state resistance  Zero off-state leakage
37
Why not use relays to compute?
- Need to compare at block level NEMS: 12 relays
4 gate delays

Delay Comparison vs. CMOS



1 mechanical delay
Single mechanical delay vs. several electrical gate delays
For reasonable load, NEMS delay unaffected by fan-out/fan-in
Area Comparison vs. CMOS


Larger individual devices
But often need fewer devices to implement same function
F. Chen et al., “Integrated Circuit Design with NEM Relays,” ICCAD 2008
38
Scaled NEMS vs. CMOS adders
Energy/op vs. Delay/op across Vdd
9x

Compare vs. Sklansky
CMOS adder*
 90nm technology

30x less capacitance

10x


2.4x lower Vdd



Lower device Cg, Cd
Fewer devices
No leakage energy
For similar area: >9x lower E/op, >10x greater delay
Scaled relays limited by contact surface energy
- 2aJ for 90nm litho – 50x better than 90nm CMOS
Patil et. al., “Robust Energy-Efficient Adder Topologies,” in Proc. 18th IEEE Symp.
39
on Computer Arithmetic (ARITH'07).
*D.
Contact resistance
- Feedback from system level Energy/op vs. Delay/op across Vdd & CL

Low contact R
not critical

Good news for
reliability…

Can build testplatforms that
work
40
CLICKR technology development platform:
NEM relay-based circuits
ISSCC 2010 – TD Award
F. Chen et al, ISSCC2010
M. Spencer et al, JSSC Jan’11
41
Towards more complex designs
Y2
(a)
Y2
(b)
A4
A4
Kill
A3
A3
A3
A3
8mm
A2
A2
Generate
Y2
A1
A1
A1
A2
A2
Y1
A1
A1
A1
A0
Y0
A0
Y2
(c)
4
10
(d)
A6
Y2
Y2
A5
A4
A4
A3
A3
OTCT (90nm)
Dadda/HC (45nm)
A2
A2
A1
A1
10
A3 A3
A2
A2 A2
A1
A1
A0
A0
A0
A1
A1
A0
Energy/op (fJ)
16-bit multipliers
1
2
10
10
3
10
Delay(ns)
Multiplier building block: 7:3 compressor
98 relays – largest working relay circuit to
date
Fariborzi ASSCC 2011
A
A1
A1
A1
A1
1
0
A2
A3
A0  A1
16X Parallel
10
A2
A5
A3 A3
A2
(a)
10
A3
A3
A0
2
A3
Y
A3
A1 A1
A0
10
A4
A4 A4
A2 A2
3
A4
A4
A3 A3
A2
A5
A4
A4 A4
A3
A5
A5
A4
A3
Y0
A6
A5
A5
Scaled MEM Relay
Y0
A6
A6 A6
A5
A5
700μm
Energy-benefit preserved even in
more complex functions
Input code
A0
A0
(b)
NEM Relay VLSI design infrastructure
P-cell
Spectre
Verilog-A
Verilog-A
Model
Schematic
Vout
Device
A
A
B
Layout
Verilog
B
Logic
Synthesis
Synthesis
Place & Route
Place
Route
LVS
DRC


Verilog-A model and Logic Synthesis created for NEMS technology
The flow supports multiple device designs and foundries
Toward full systems - NEM Relay scaling
1um litho
Relay size
120um x 150um
0.25um litho
Scaled Relay size
20um x 20um
Sematech
44
Microcontroller Test-Chip
64x8b
Scratchpad
64x18b
Program Memory
2 x 72 I/O Pads
Control Logic
Register File + ALU


Instruction
Decode
32x10b
Program Stack
12k relays
9mm x 6mm (using 85um x 53um devices)
45
Summary

Cross-layer modeling and design key to continued system
performance scaling



Building early technology development platforms



Feedback to device and circuit designers
Accelerated adoption
EOS Platform designed for multi-project wafer runs




Fast design-space exploration
Feedback to all layers of design hierarchy
50 fJ/b receivers with uA sensitivities
Record-high tuning efficiency with undercut ~ 25uW/K
First modulation demonstrated in 45nm process
CLICKR Platform designed for multiple foundries and devices


Energy-gains preserved for larger blocks
Designs moving toward scaled devices and full VLSI systems
46
Acknowledgments

Devices: Tsu-Jae King Liu, Rajeev Ram, Miloš Popović, Henry Smith

Architecture: Krste Asanović, Christopher Batten, Ajay Joshi

Circuits: Elad Alon, Dejan Marković

Students:

Devices - Jason Orcutt, Anatoly Khilo, Jie Sun, Cheryl Sorace,
Eugen Zgraggen, Jaeseok Jeon, Rhesa Nathanael, Hei Kam

Circuits – Michael Georgas, Jonathan Leu, Ben Moss, Chen Sun, Fred Chen,
Byungsub Kim, Hossein Fariborzi, Matthew Spencer, Chengcheng Wang, Kevin
Dwan

Architecture - Yong-Jin Kwon, Scott Beamer, Chen Sun, Imran Shamim

DARPA MTO
Texas Instruments – Dennis Buss and Tom Bonifield
IBM and Trusted Foundry
Intel Corporation – Ian Young and Alex Kern



47