Download Building Modern Integrated Systems: A Cross-cut Approach

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Wireless power transfer wikipedia , lookup

Flexible electronics wikipedia , lookup

Telecommunications engineering wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Electrical engineering wikipedia , lookup

Relay wikipedia , lookup

Surge protector wikipedia , lookup

Optical rectenna wikipedia , lookup

Microprocessor wikipedia , lookup

Power engineering wikipedia , lookup

Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup

Distributed generation wikipedia , lookup

Protective relay wikipedia , lookup

Electronic engineering wikipedia , lookup

Opto-isolator wikipedia , lookup

CMOS wikipedia , lookup

Transcript
Building Modern Integrated Systems:
A Cross-cut Approach
Vladimir Stojanović
Integrated Systems Group
Massachusetts Institute of Technology
Acknowledgments

Devices: Rajeev Ram, Miloš Popović, Tsu-Jae King Liu

Architecture: Krste Asanović, Christopher Batten, Ajay Joshi

Circuits: Elad Alon, Dejan Marković

Post-docs and Students :

Devices - Jason Orcutt, Jeffrey Shainline, Anatoly Khilo, Jie Sun, Cheryl Sorace, Mark
Wade, Eugen Zgraggen, Jaeseok Jeon, Rhesa Nathanael, Hei Kam, Stevan Urosevic,
Josh Wang, Ekaterina Kononov

Circuits – Michael Georgas, Jonathan Leu, Ben Moss, Chen Sun, Yu-Hsin Chen, Fred
Chen, Byungsub Kim, Hossein Fariborzi, Matthew Spencer, Chengcheng Wang, Kevin
Dwan

Architecture – Yunsup Lee, Yong-Jin Kwon, Scott Beamer, Chen Sun,
Yu-Hsin Chen, Imran Shamim

Micron Technology: Roy Meade, Gurtej Sandhu, Reha Bafrali, Emanuele Baracchi, Miri
Baruch, Erez Conforti, Paul Daley, Eyal Friedman, Harel Frish, Dana Haran, Lilach Makrabi,
Maxim Rabinovitch, Matt Ross, Yoel Shetrit, Zvi Sternberg, Ofer Tehar-Zahav

DARPA MTO
IBM Trusted Foundry
Texas Instruments – Dennis Buss and Tom Bonifield
Intel Corporation – Ian Young and Alex Kern



2
Chip design needs radical changes


Already have more devices than can use at once
Limited by power density and bandwidth
Intel Knights Corner
50 cores, 200 Threads
Oracle T5
Nvidia Fermi
16 cores, 128 Threads 540 CUDA cores
Intel 4004 (1971):
4-bit processor,
2312 transistors,
~100 KIPS,
10 micron PMOS,
11 mm2 chip
IBM Power 7
8 cores, 32 threads
1000s of
processor “The Processor is
cores and the new Transistor”
[Rowen]
accelerators
per die
3
Subthreshold leakage: Game over for CMOS
Energy/op vs. 1/throughput
25
100
20
80
Normalized Energy/op
Normalized Energy/cycle
Energy/op vs. Vdd
15
Etotal
10
Edynamic
5
Eleak
0.1

0.2
0.3
Vdd (V)
0.4
0.5
60
Scale Vdd & VT:
40
20
0 1
10
2
10
3
4
10
10
1/throughput (ps/op)
5
10
CMOS circuits have well-defined minimum energy



Caused by leakage and finite sub-threshold slope
Need to balance leakage and active energy
Limits energy-efficiency, regardless how slow the circuit runs
4
Wire and I/O scaling problems
I/O
On-chip wires
Best electrical links
On-chip wires
copper resistivity
Energy-cost [pJ/b]
18
16
Chip2Chip
Backplane
Loss ~20-25dB
14
12
10
8
6
4
Loss ~10dB
2
0
0
5
10
15
20
25
Data-rate [Gb/s]


Increased wire resistivity makes wire caps scale very slowly
Can’t get both energy-efficiency and high-data rate in I/O
5
Bandwidth, pin count and power scaling
Package pin count
256 cores
2,4 cores
*> half pins for power supply
Need 16k pins
in 2017 for HPC*
2 TFlop/s signal pins @ 20 Gb/s/link
1 Byte/Flop
6
Emerging devices can help
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Network &
µArchitecture
Post-CMOS – need cross-cut
approach to guide new
devices/systems
Design
Optimization
Communications
(Eq., Mod, Coding)
2.5
Energy/Bit (pJ/Bit)
2
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
Circuit modeling,
Characterization
1.5
1
0.5
0
0
1
2
Data Rate Density (Gbps/um)
Φ
Φ
Circuits & Logic
Tx, Rx, Ctrl, Meas
Φ
in-
in+
IPHOTO
Φ
Φ
Φ
3
Interconnect
and switch
technology
Cu
MOSFET
Si-Photonics
7
Monolithic Si-Photonics for core-to-core and
core-to-DRAM networks
Supercomputers
Si-photonics in advanced
CMOS and DRAM process
NO costly process changes
Embedded apps
Bandwidth density – need dense WDM
Energy-efficiency – need monolithic integration
8
8
Monolithic CMOS photonic integration
<150 nm SiO2
Thin BOX SOI CMOS Electronics
Bulk CMOS Electronics
9
Si and polySi waveguide formation
9
Integrated photonic interconnects
• Each λ carries one bit of data
Bandwidth Density achieved
through DWDM
Energy-efficiency achieved
through low-loss optical
components and
tight integration
11
Single channel link tradeoffs
Loss
10-dB
15-dB
Rx Cap
5-fF
25-fF
12
Resonance sensitivity
3
0
1
2
3
0
Direct thermal tuning
Wafer-level ring variation
data from our Micron designs


Process and temperature shift resonances
Direct thermal tuning cost prohibitive
Georgas CICC 2011, Sun NOCS 2012
13
Smarter wavelength tuning
Georgas CICC 2011,
Sun NOCS 2012
Nearest channel
tuning + reshuffling


Utilize systematic global mismatch and
temperature shifts
Electrical backend enables dense WDM

Helps reduce tuning costs
by more than 10x
14
Need to optimize carefully
512 Gb/s aggregate throughput

Laser energy increases with data-rate
–
–


Limited Rx sensitivity
assuming 32nm CMOS
Modulation more expensive -> lower extinction ratio
Tuning costs decrease with data-rate
Georgas CICC 2011
Moderate data rates most energy-efficient
15
DWDM link efficiency optimization


Optimize for min energy-cost
Bandwidth density dominated by circuit and photonics
area (not coupler pitch)


10x better than electrical bump limited
200x better than electrical package pin limit
16
Many architectural studies show promise
[Shacham’07]
[Petracca’08]
[Koka’08-10]
[Joshi’09]
[Pan’09]
[Batten’08]
[Vantrease’08]
[Psota’07]
[Kirman’06]
[Beamer’10]
17
Photonic memory interface – leveraging
optical bandwidth density
Super DIMM
Laser in
CPU
Important Concepts
DRAM cube 1
- Power/message switching
DRAM cube 4
MC 1
(only to active DRAM chip in
DRAM cube/super DIMM)
- Vertical die-to-die coupling
Dwr
Drd
MC K
cmd
Drd
(minimizes cabling - 8 dies per
DRAM cube)
Dwr
Mem Scheduler
( cube 1, die 8)
die-die
switch
cmd
-Command distributed
electrically (broadcast)
- Data photonic (single writer
multiple readers)
Dwr
Drd
( cube 1, die 1)
Super DIMM K
DRAM cube 4
Modulator bank
MC 16
Processor die
Receiver/PD bank
Tunable filterbank
Through silicon via
Through silicon via hole
Enables energy-efficient
throughput and capacity
scaling per memory channel
Beamer ISCA 2010
18
Laser power guiding effectiveness

Enables capacity scaling per channel and significant
Beamer ISCA 2010
savings in laser energy
19
Optimizing DRAM with photonics
P1
Floorplan
P4
Beamer ISCA 2010
20
Design Space Exploration of Networks Tool
DSENT – A Tool Connecting Emerging Photonics with Electronics for OptoElectronic Networks Modeling
Chen NOCS 2012
DSENT
Model
Parameters
Nin
Nout
fclock
...
Technology
Parameters
Process
VDD
Wmin
T
...
User-Defined Models
Area
Arbiter
Router
Mesh Network
Multiplexer
Crossbar
Repeated Link
Electrical Clos
Decoder
Buffers
Optical Link
Photonic Clos
Support Models
Standard Cells
Optical Link
Components
Technology Characterization
Non-DataDependent Power
Data-Dependent
Energy
Tools
Timing Optimization
Expected
Transitions
Delay
Optical Link
Optimization
Available for download at:
https://sites.google.com/site/mitdsent/
21
Kurian IPDPS
2012
21
Significant integration activity,
but hybrid and older processes …
130nm
thick BOX SOI
130nm
thick BOX SOI
[IBM]
[Luxtera/Oracle/Kotura]
[Many schools]
Bulk CMOS
Backend
monolithic
[Intel]
[HP]
[Watts/Sandia/MIT]
[Lipson/Cornell]
[Kimerling/MIT]
22
EOS Platform for Monolithic CMOS
2011
photonic integration
Joint work with Ram and Popovic
2007
0
Transmission, dB
-2
-4
-6
-8
45 nm SOI CMOS
IBM 12SOI
-10
-12
-14
-200
0
200
400
600
800
1000
Frequency, GHz
32 nm bulk CMOS
Texas Instruments
90 nm bulk CMOS
IBM cmos9sf
65 nm bulk CMOS
Texas Instruments
Create integration platform to accelerate
technology development and adoption
23
EOS Platform: Fabricated in IBM 45nm SOI
Orcutt et al,
Optics Express, 2012
3 x 3 mm die
45nm Thin Box SOI
Technology
(used for Power 7 and
Cell processors)
3M Transistors
400 Pads
ARM Standard Cells
and
custom link circuits
24
EOS performance summary
Fiber-to-chip grating
couplers with 3.5 dB
insertion loss
Waveguides under
4dB/cm propagation
loss
10 dB extinction
optical modulators
8 channel wavelength
division multiplexing
filter bank with
<-20 dB cross talk
SiGe photodetectors
All integrated with
electronic circuits
25
Integration of photonics into VLSI tools
Layout of
photonics
abstract
Layout of
Circuit blocks
abstract
LEF
LEF
LEF of standard cells, I/O pads
(provided by ARM)
layout
modulator.LEF
VERSION 5.6 ;
BUSBITCHARS "[]" ;
DIVIDERCHAR "/" ;
MACRO block_electronic_etch_row_1
CLASS BLOCK ;
ORIGIN -208 -1794 ;
FOREIGN block_electronic_etch_row_1 208 1794 ;
SIZE 2488 BY 165 ;
SYMMETRY X Y R90 ;
PIN heater_a_1
DIRECTION INOUT ;
USE SIGNAL ;
PORT
LAYER ua ;
RECT 431 1870.5 436.5 1882 ;
END
END heater_a_1
...
OBS
LAYER m1 ;
RECT 208 1794 2696 1959 ;
...
END
END block_electronic_etch_row_1
Chip-level verilog
(instantiation of
.LEF macros and
connectivity)
Floorplan
(macro placement,
power grid, routing
Constraints)
SOC Encounter
Place and route
Place&routed
layout
Technology files
END LIBRARY
abstract
Photonic device
p-cell
custom photonics-friendly auto-fill
26
Platform organization
27
A full electro-optical test setup
Fiber Positioner
Microscope
DUT
Fiber
Positioner
Chip
Board
HS
Clocks
Control
Board
FPGA
USB to laptop
28
Best waveguide losses ever reported in a
sub-100nm production CMOS line

Body-Si waveguides


470nm width
Poly waveguides


3-4dB/cm loss
50dB/cm loss
700nm width
700nm width
Body-Si ring Q factor


227k @ 1280nm
112k @ 1550nm
29
Exceptional dimensional control in 45nm node
through
drop8 drop7
drop6
drop5
drop4
drop3
drop2
drop1
250 GHz
spacing
input
> 20 dB
isolation
30 GHz
bandwidth


8-wavelength filterbank results
 Filter channels fabricated in order
 Less than 1nm variation
Excellent channel isolation (>20dB at 250GHz spacing)
30
Integrated delta-sigma heater control
~10mW required
to retune all 8 rings

Tuning efficiency 130µW/K


Thermal tuning BW
~1MHz
Tuning control overhead
negligible
On fully substrate removed die (45nm SOI)
100µW/K on Deep-trench bulk die (Micron)
31
Current-sensing optical data receiver
Receiver detects photo current
50fJ/b, µA sensitivities, 3-5Gb/s
Georgas ESSCIRC 2011, JSSC 2012
32
Optical modulator design
Shainline, Popovic
Carrier-injection device
at 1550nm
• Extinction ratio 19dB
• 45GHz 3dB optical bw
at 1280nm
• Extinction ratio 9dB
• 60GHz 3dB optical bw
33
Optical modulator – electrical tests

Carrier-lifetime 2-3ns

200MHz electrical bandwidth
Diffusion time constant affected by
–
–
Recombination time
Drift conditions
34
Modulator driver sub-bit pre-emphasis

Partial forward bias at 0-bit key to fast operation
35
Modulator driver heads

Split-supply used for sub-bit pre-emphasis

Use core and I/O voltage – no regulators
36
Fastest modulation in 45nm process




2.5Gb/s modulation
1.2pJ/bit
3dB insertion loss
3dB extinction ratio
Moss ISSCC 2013
37
Power and pins required for 10TFlops/s
80Tb/s sustained
bandwidth
assuming 1B/Flop
Total memory channel power [W]
1600
DDR4
1400
Mobile LPDDR2-1066
GDDR5
1200
Mobile LPDDRX-1666
Mobile LPDDRX 2017
1000
DDR3-1333 4GB
800
HMC
600
400
200
DDR4-2667 8GB
GDDR5
LPDDR
HMC-Gen1
HMC-Gen2
POEM
PIM
POEM Phase 1
POEM Phase 2
POEM Post-phase 2
0
100
1000
10000
100000
# socket pins required for memory channels
38
Improving computation efficiency
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Network &
µArchitecture
Post-CMOS – need cross-cut
approach to guide new
devices/systems
Design
Optimization
Communications
(Eq., Mod, Coding)
2.5
Energy/Bit (pJ/Bit)
2
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
Circuit modeling,
Characterization
1.5
1
0.5
0
0
1
2
Data Rate Density (Gbps/um)
Φ
Φ
Circuits & Logic
Tx, Rx, Ctrl, Meas
Φ
in-
in+
IPHOTO
Φ
Φ
Φ
3
Interconnect
and switch
technology
Cu
MOSFET
Si-Photonics
NEMS relay
39
Nano-electro-mechanical (NEM) relays
Joint work with T-J. King Liu, E. Alon and D. Markovic (UCB, UCLA)
30mm
Gate
Oxide
90nm
Body
Drain A
Gate
Body
2 7.
5mm
A’
Source
Channel
Relay schematic

Nearly ideal switching characteristics:



Low on-state resistance (Ron <10kΩ)
Infinite off-state resistance  Zero off-state leakage
But, relatively slow actuation
40
Why not use relays to compute?
- Need to compare at block level NEMS: 12 relays
4 gate delays

Delay Comparison vs. CMOS



1 mechanical delay
Single mechanical delay vs. several electrical gate delays
For reasonable load, NEMS delay unaffected by fan-out/fan-in
Area Comparison vs. CMOS


Larger individual devices
But often need fewer devices to implement same function
F. Chen et al., “Integrated Circuit Design with NEM Relays,” ICCAD 2008
41
Scaled NEMS vs. CMOS adders
Energy/op vs. Delay/op across Vdd
9x

Compare vs. Sklansky
CMOS adder*
 90nm technology

30x less capacitance

10x


2.4x lower Vdd



Lower device Cg, Cd
Fewer devices
No leakage energy
For similar area: >9x lower E/op, >10x greater delay
Scaled relays limited by contact surface energy
- 2aJ for 90nm litho – 50x better than 90nm CMOS
Patil et. al., “Robust Energy-Efficient Adder Topologies,” in Proc. 18th IEEE Symp.
42
on Computer Arithmetic (ARITH'07).
*D.
Contact resistance
- Feedback from system level Energy/op vs. Delay/op across Vdd & CL

Very low contact
R not critical

Good news for
reliability…

Can build testplatforms that
work
43
CLICKR technology development platform:
NEM relay-based circuits
ISSCC 2010 – TD Award
F. Chen et al, ISSCC2010
M. Spencer et al, JSSC Jan’11
44
Towards more complex designs
Y2
(a)
Y2
(b)
A4
A4
Kill
A3
A3
A3
A3
8mm
A2
A2
Generate
Y2
A1
A1
A1
A2
A2
Y1
A1
A1
A1
A0
Y0
A0
Y2
(c)
4
10
(d)
A6
Y2
Y2
A5
A4
A4
A3
A3
OTCT (90nm)
Dadda/HC (45nm)
A2
A2
A1
A1
A4
A3
A3
A2
A2
Y
A4 A4
A3
A3
A3 A3
A2
A2 A2
A
A3
A0  A1
A1
A1
A5
A3 A3
A2
A2 A2
3
A4
A4
A3 A3
A2
A5
A4
A4 A4
A3
A5
A5
A4
A3
Y0
A6
A5
A5
Scaled MEM Relay
Y0
A6
A6 A6
A5
A5
700μm
Energy-benefit preserved even in
more complex functions
A0
A1
A1
A1
A1
A1
A1
A1 A1
A0
10
A0
A0
A0
A0
Energy/op (fJ)
(a)
A0
A0
(b)
16X Parallel
2
10
16-bit multipliers
1
10
0
10
1
2
10
10
3
10
Delay(ns)
Multiplier building block: 7:3 compressor
98 relays – largest working relay circuit to
date
Fariborzi ASSCC 2011
Input code
45
NEM Relay VLSI design infrastructure
P-cell
Spectre
Verilog-A
Verilog-A
Model
Model
Schematic
Vout
Device
A
A
B
Layout
Verilog
B
Logic
Synthesis
Synthesis
Place & Route
Place
Route
LVS
DRC


Verilog-A model and Logic Synthesis created for NEMS technology
The flow supports multiple device designs and foundries
46
Toward full systems - NEM Relay scaling
UC Berkeley: 1 µm litho
SEMATECH: 0.25 µm litho
9mm
Test Devices
8-bit adders
9mm
4-bit and 2-bit adders
2-bit accumulator
SRAM
Flip-flops
DRAMs
7:3 Compressor
4-bit DAC
4-bit ADCs
4-bit DAC
Oscillators
1st prototype: 120 µm x 150 µm
Scaled relay: 15 µm x 15 µm
47
Toward full systems - NEM Relay scaling
Y1
Y2
Y1
Y2
A6
A5
A5
A3
A3
A6
A6
A5
A5
A5
A4
A4
A4
A3
A3
A3
A2
A2
A1
A0
INPUT
CODE
Input code
Y2 output (V)
Y0
6
Y1 output (V)
Y1
A2
A2
A2
A2
A1
A1
A1
A1
6
6
A4
A3
A0
A2
A1
A0
A0
A0
Y0 output (V)
Y2
A5
A3
A3
A1
4
2
0
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
4
2
0
Scaled relay: 15 µm x 15 µm
4
2
0
0
1
2
3
0
1
2
3
4
5
6
7
8
9
5
6
7
8
9
150
100
50
0
4
Time(S)
Time (s)
output code
A2
Y0
A6
A4
A4
A3
A3
SEMATECH: 0.25 µm litho
Y0
A5
Y0
A4
A4
46 6T relays
Y1
7
Correct result
Test result
5
3
1
0
20
40
60
input code
80
100
120
World’s smallest relay multiplier circuit
48
Summary

Cross-layer modeling and design key to continued system
performance scaling



Building early technology development platforms



Feedback to device and circuit designers
Speerhead adoption
EOS Platform designed for multi-project wafer runs






Fast design-space exploration
Feedback to all layers of design hierarchy
Best end-of-line passives in sub-100nm process (3-4dB/cm loss)
50 fJ/b receivers with uA sensitivities
Record high tuning efficiency with undercut ~ 25µW/K
First modulation demonstrated in 45nm process
Ported to bulk Micron process
CLICKR Platform designed for multiple foundries and devices



Energy-gains preserved for larger blocks
Largest working relay circuits demonstrated
Designs moving toward scaled devices and full VLSI systems
49