Download thesis

Document related concepts

Islanding wikipedia , lookup

Electronic engineering wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Telecommunications engineering wikipedia , lookup

Resistive opto-isolator wikipedia , lookup

Buck converter wikipedia , lookup

Mains electricity wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Alternating current wikipedia , lookup

Flip-flop (electronics) wikipedia , lookup

Atomic clock wikipedia , lookup

Heterodyne wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Optical rectenna wikipedia , lookup

Rectiverter wikipedia , lookup

Regenerative circuit wikipedia , lookup

CMOS wikipedia , lookup

Time-to-digital converter wikipedia , lookup

Opto-isolator wikipedia , lookup

Transcript
A 9GHz Injection Locked Loop Optical Clock
Receiver in 32-nm CMOS
OFTi
I
by
L
CT J 5 2010
LIBRARIES
Jonathan Chung Leu
Submitted to the Department of Electrical Engineering and Computer
Science
ARCHIVES
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2010
@ Massachusetts Institute of Technology 2010. All rights reserved.
A uthor ........
.
.............
.............
....................
Department of Electrical Engineering and Computer Science
August 16, 2010
Certified by ........
4~,4/4~
Y77
6~
,-.
Vladimir Stojanovid
Associate Professor
Thesis Supervisor
,-:
7/
Accepted by............
Terry P. Orlando
Chairman, Department Committee on Graduate Theses
E
A 9GHz Injection Locked Loop Optical Clock Receiver in
32-nm CMOS
by
Jonathan Chung Leu
Submitted to the Department of Electrical Engineering and Computer Science
on August 16, 2010, in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
Abstract
The bottleneck of multi-core processors performance will be the I/O, for both onchip core-to-core I/0, and off-chip core-to-memory. Integrated silicon photonics can
potentially provide high-bandwidth low-power signal and clock distribution for multicore processors, by exploiting wavelength-division multiplexing. This thesis presents
the technology environment of the monolithic optical/electrical chip, and then focuses
on how an optical method would look like for both source-synchronous link and
for on-chip global clock distribution. The injection-locked loop clock receiver that
suits this architecture breaks the bandwidth/sensitivity tradeoff, and a self-adjusting
mechanism is added to increase robustness. The simulated receiver sensitivity is 14dBm at 9GHz, consuming 77.14pW and generating jitter within 0. 15ps when locked
onto a mode-locked laser clock source. The chip infrastructure and testing procedures
are then presented, and lastly a truly integrated optical-electrical design flow is shown
as well.
Thesis Supervisor: Vladimir Stojanovid
Title: Associate Professor
Acknowledgments
thanks to MIT, prof, lab mates, CBCGB/COM, family and friends
fi
Contents
1 Introduction
17
2 Photonic Technology
3
4
2.1
Laser Source, Vertical Couplers and Optical Waveguides
. . . . . . .
17
2.2
Optical Modulators...... . . . .
. . . . . . . . . ..
. . . . . . .
18
2.3
Photodiodes . . . . . . . . . . . . . . . . . . . . . .. .
. . . -. - --
19
2.4
Optical Pulse Source . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.5
Sum m ary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
23
Clocking
3.1
Electrical Clocking for Links . . . . . . . . . . . . . . . . . . . . . . .
23
3.2
Optical Clocking for Links . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3
Electrical Clocking for Synchronous Logic
. . . . . . . . . . . . . . .
26
3.4
Optical Clocking for Synchronous Logic . . . . ......
. . . . . . .
28
3.5
Goals of an Optical Clock Receiver
. . . . . . . . . . . . . . . . . . .
30
3.6
Sum m ary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
33
Optical Clock Receiver Circuit Design
4.1
4.2
Injection Locked Loop . . . . . . . . . . . . . . . . . . . . . .
. . .
35
. . . . . . . . . . . . . . .
4.1.1
Simple Integrating Receiver
4.1.2
Integrating Receiver with Tunable Reset
4.1.3
Integrating Receiver with Precharge - Injection-Locked
. . . . . . . .
Auto-locking design . . . . . . . . . . . . . . . . . . . . . . . .
35
. . .
39
. . .
41
. . .
43
4.3
5
Sum m ary
.
. . . . . . . . . . .
. . . . . . . . .
. . . . .
EOS2 32nm Test Chip
45
47
5.1
EOS2 Chip Overview
5.2
EOS2 Cell ........
.................................
49
5.3
EO S2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.3.1
Clock Receiver Testing . . . . . . . . . . . . . . . . . . . . . .
51
5.3.2
Chip Testing
52
5.4
.
.....
...........................
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary......... . . . . . . . . . . . .
. . . . . . . . . . . .
6 Integrated CMOS/Photonic Chip Design
47
53
55
6.1
Parametrized photonic layout generation and optimization . . . . . .
55
6.2
Design Automation of Electric and Photonic Integrated Circuits . . .
56
6.3
Simple design check using LVS and DRC . . . . . . . . . . . . . . . .
60
6.4
Sum m ary
61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Conclusion and Future Work
7.1
Conclusion.......... . . . .
7.2
Future Work......... . . . . . . . .
63
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
63
63
List of Figures
1-1
An intra-chip integrated photonic link implementing WDM. An inter-
chip link may be implemented similarly. [1] . . . . . . . . . . . . . . .
2-1
Cross-section and SEM image of the waveguides and air gap. Note
that the air gap does not reach the transistors. [1, 2]
2-2
. . . . . . . . .
18
Illustration of ring modulator. Note that diodes were only implemented
on the straight sections of the ring due to fabrication constraints. [3]
2-3
16
19
SEM image of a modulator. On the sides are vertical couplers for the
input/output laser light, and the dark rectangles close to the ring are
both PIN diodes used for modulating the ring. [1]... . . .
3-1
. ...
20
Conventional source-synchronous unidirectional and differential pointto-point parallel link, where the clock is sent along with data for easy
receiver timing recovery. [4]
3-2
. . . . . . . . . . . . . . . . . . . . . . .
24
Inter-signal skew reduces timing margins at receiver. As inter-signal
skew increases, samples are taken farther away from widest opening of
eye, lowering the SNR, resulting in a higher BER. [4] . . . . . . . . .
3-3
25
A simple example of a point-to-point optical WDM link. The rings on
the transmitter side are modulators driven by the driver circuits, and
the rings on the right couple the selected wavelength into photodiodes
. . . . . . . . . . . . . . .
25
3-4
Simple example of a synchronous pipeline datapath. [5] . . . . . . . .
26
3-5
Diagram of purely electrical clock distribution network. . . . . . . . .
27
3-6
Diagram of purely optical clock distribution network. . . . . . . . . .
29
that are connected to individual receivers.
3-7
Diagram of hybrid optical and electrical clock distribution network. .
30
4-1
Schematic of a simple Trans-Inpedance Amplifier.......... .
33
4-2
Schematic of an amplifier of finite gain A with feedback.
.
R is the
feedback resistor, C is the input capacitance, I is the input current.
4-3
..
...
...
..
...
..
.... . . .
. ...
35
Schematic of an amplifier of finite gain A without feedback. C is the
input capacitance, I is the input current. . . . . . . . . . . . . . . . .
4-5
34
Transient plots of input voltage for when R is finite and when R is
infinite.........
4-4
.
36
Diagram explaining the effect of voltage noise on jitter. A steep slope
has less jitter for the same input noise. . . . . . . . . . . . . . . . . .
36
4-6
Schematic of simple clock receiver with reset NMOS.
37
4-7
Transient waveforms at node in (top left) and out (top right) with
. . . . . . . . .
respect to the switching threshold of the first inverter Vm and optical
p u lses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4-8
Schematic of simple clock receiver with a tunable reset NMOS stack.
39
4-9
Transient waveforms at node in (top left) and out (top right) with
respect to the switching threshold of the first inverter Vm and optical
p u lses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-10 Schematic of tunable reset NMOS stack..
. . . . . . . . . . . . .
39
40
4-11 Schematic of manually tuned clock receiver circuit with PMOS precharge
branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4-12 Transient waveforms at node in (top left) and out (top right) with
respect to the switching threshold of the first inverter Vm and optical
p ulses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-13 Schematic of auto-locking clock receiver circuit.
42
The node 'Vauto'
connects the capacitor to the gates of the tuning transistors, adjusting
their strength... . . . . . . .
. . . . . . . . . . . . . . . . . . . .
43
4-14 Simulated transient waveform of in, output(out4), and Vauto(vm-prime).
Before 60ns, the reset NMOS is too weak, so the input node is constantly above Vin, hence out4 is stuck at Vdd. Vauto slowly rises until
lock is achieved at around 60ns and stabilizes at about 65 ns. . . . . .
44
4-15 Schematic of auto-locking clock receiver circuit with dark current conpensation and independently tunable current sources for duty cycle
correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-1
45
Layout view of EOS2 test chip. Sandwiched between loss measurement structures and individually testable optical devices are the circuit/photonic WDM rows. The chip dimensions are 2mm by 2mm.
5-2
.
48
Layout view of circuit cell of EOS2 with photonic devices. Note the
relative size of the vertical couplers, rings and spacing between photonics and circuits. Also note the row of etch vias below the photonics.
The circuit cell dimensions are 220im by 40pm. . . . . . . . . . . . .
5-3
49
Block diagram of a circuit cell of EOS2. Analog blocks are connected
to photonic devices below the cell, and the high speed support and low
speed configuration registers are above............ . .
5-4
. ...
50
Block diagram of the high speed testing infrastructure for the clock
receiver circuit. The number of bits in each counter are actually 20.
(modified to 40 bits in later chip for better accuracy)
5-5
. . . . . . . . .
51
Here we write the value "New" to Cell 2, while preserving the values
of the other cells.
This is done by feeding back the last bit of the
scanchain into the input, and only feeding in new data at the right time. 53
6-1
The waveguide on the left is drawn with polygons with many vertices;
the waveguide on the right is drawn with just rectangles.
. . . . . . .
6-2
Integrated electric and photonic circuit design flow......
. ...
6-3
Layout view of example ring modulator . . . . . . . . . . . . . . . . .
56
57
58
6-4
Abstract view of example ring modulator. Note that lower level metal
features such as metal fill have been blocked out, since port connections
are made at higher levels.
. . . . . . . . . . . . . . . . . . . . . . . .
58
6-5
Abbreviated macro description in LEF format of example ring modulator. 59
6-6
Layout of the same circuit as in Figure ??. . . . . . . . . . . . . . . .
60
6-7
Schematic of circuit with both photonic and electrical components.
61
.
List of Tables
3.1
Power Efficiency and Mismatch relative to Optical H-Tree Depth . . .
28
14
Chapter 1
Introduction
The emergence of multi-core processors [29] [30] [31] [32] creates an increasingly complex routing environment for long interconnects. With traditional copper interconnects, area is expensive, as each wire is only able to carry one data signal at maximum
bandwidth. Although the I/O circuits are improving in bandwidth as technology
scales, the channels are not. Since each core must be able to communicate with a
set of other cores and off-chip memory, the total bandwidth required is harder to
reach under the overall die power constraint, set by power delivery and cooling requirements. If copper wires are not replaced, we will quickly reach a point where it
is simply infeasible to add more cores.
To address this problem, this work investigates the shift to an optical paradigm,
where by exploiting wavelength-division multiplexing (WDM), multiple channels of
data can be transmitted along a single optical waveguide without interference. This
allows a much greater bandwidth density for both on-chip and off-chip channels. Also,
with low loss waveguides which are theoretically possible, the total power consumption
for interconnects can be greatly reduced. With WDM, we can use architectures that
have fewer routers, hence lower latency.
A cartoon of an integrated photonic intra-chip WDM link is shown in Figure 1-1 [1].
The laser light from an external laser source is coupled into chip A, each channel is
modulated separately, and then sent to another chip to receive the data. Besides data
transmission, it is possible to have an energy efficient and small area clock receiver
with a single wavelength modulated clock signal, or a low power, low skew/jitter clock
distribution network by using a mode-locked laser or self-pulsed laser to send short
optical pulses into an optical clock distribution network. This alleviates the need for
PLLs or DLLs, saving area and power. The focus of this thesis will be on optical
clocking, and comparing it to a traditional clocking scheme.
Extemal
Lase
Chip A
Waveguide
Source
I
Coupler
Soure
Chip 8
7
~Ring
Modulator
0
Waegude
Rng
Potdtco
0 Ring Filter
/\ Potodetector
OSRiggFlte
Fiber
Figure 1-1: An intra-chip integrated photonic link implementing WDM. An inter-chip
link may be implemented similarly. [1]
An important feature of this project is that all our photonic structures will be made
using standard CMOS processes. Since the photonic devices and their control circuits
are only tens of microns apart, this monolithic integration reduces the area and energy
costs of interfacing electrical and optical components. Also, monolithic integration
is practical in the sense of cost and manufacturability for processor and memory
manufactures, but at the same time constrains the materials and dimensions that
we are able to use. We believe that this is a tradeoff that must be made for silicon
photonics to be implemented in mainstream processor design.
A first integrated
photonics chip was designed and manufactured in Texas Instrument's 65nm bulk
CMOS [1], and the designs in this thesis were made in a 32nm bulk CMOS process.
In the following sections, we will go over the key points of our photonic technology
and describe the characteristics of the photonic devices used in this project. Next we
will look at how clocking is done traditionally, then point out the reasons why optical
clocking makes sense. The optical clock receiver circuit will be explained in detail,
which is the main focus of this thesis. Finally, a quick summary of the infrastructure
for on chip testing and automatic layout will be given.
Chapter 2
Photonic Technology
For simplicity of integration, and to be compatible with the majority of CMOS processes, we chose to design our chip with a monolithic CMOS process, without flip-chip
bonding a III-V wafer or adding additional layers [1). Having this constraint narrows
the design space, but as mentioned in the introduction, this allows the photonics to
be imported into a microprocessor design flow without an overhaul of the existing
technology.
2.1
Laser Source, Vertical Couplers and Optical
Waveguides
Due to the indirect bandgap of silicon, there are no known high-efficiency lasers in
silicon, so all the silicon-photonic technologies proposed here use off-chip laser sources.
To couple the light into the waveguides from an optical fiber, we use vertical couplers,
which are more feasible from a packaging stand point compared to edge coupling.
Vertical couplers have a bragg-grating like structure in a roughly 10pmn by 10pm
area, which then tapers into a single-mode waveguide with a width of approximately
have a micron.
Previously, on-chip photonic waveguides have been made using the silicon body as
a core with custom thick buried oxide (BOX) as cladding in a silicon-on-insulator
(SOI) process [6], or by depositing silicon-nitride [7] on top of the interconnect stack.
These methods either need significant process changes to a standard CMOS flow,
or have poor heat dissipation properties, which may cause problems as high-power
density is common in manycore computation. Given that compatibility to mainstream
processes is a priority, we have chosen to use bulk CMOS, and the polysilicon layer for
the waveguide. Since the shallow-trench oxide beneath the polysilicon is too thin to
form an effective cladding and shield the core from optical mode leakage losses into the
silicon substrate, we have developed a novel self-aligned post-processing procedure,
as shown in Figure 2-1, to etch away the silicon substrate underneath the waveguide
forming an air gap [2].
With adequate layout spacing and measured etching, this
post-processing should not affect the transistors.
Backend
Dielectric
Silicon
Substrate
Polysilicon
Transistor Gates
Shallow Trench
ISolation
Hole
Polysilicon
Waveguides
Polysilion
W
m
Etch Ho4e
Ar Gap
Figure 2-1: Cross-section and SEM image of the waveguides and air gap. Note that
the air gap does not reach the transistors. [1, 2]
2.2
Optical Modulators
Data is transmitted by modulating the continuous wave laser light, and this is done by
changing the resonance frequency of a ring modulator that is near the waveguide. Due
to their smaller size, ring modulators have a smaller capacitance, which results in less
power consumption compared to Mach-Zehnder modulators. As shown in Figure 2-2,
the intrinsic poly-silicon ring is sandwiched by N-type and P-type poly-silicon at each
straight section, forming lateral PIN diodes, so we can inject free carriers into the
ring. A scanning electron microscope (SEM) photo of the ring modulator is shown
in Figure 2-3. By changing the amount of free carriers in the ring, we can change
its effective index of refraction, which in turn change the resonance frequency. The
continuous wave laser light that passes by on a waveguide will get coupled into the ring
and absorbed if the resonance frequency matches the frequency of the light. Multiple
transmitters can modulate light at different frequencies, so we can transmit multiple
channels of data simultaneously on the same waveguide. The temperature of the ring
also affects the index of refraction, so in-plane polysilicon heaters are placed to control
the temperature. Due to the large current range to guarantee wavelength isolation,
the heaters are directly connected to pads, and driven by current DACs on the test
board. In order to control the temperature, the resistance of the heater is monitored
by measuring the voltage drop across it with an off-chip ADC, and in conjunction
with the temperature coefficient of polysilicon, we can derive the temperature of the
heater.
p+doped poly
n+doped poly
Drop
Undoped poly
waveguide
Silicided pol~y
heater resistor
Through
p+ doped poly
n+ doped poly
Figure 2-2: Illustration of ring modulator. Note that diodes were only implemented
on the straight sections of the ring due to fabrication constraints. [3]
2.3
Photodiodes
At the receiving end, a ring is tuned to the channel it wants to receive, and the
light at that frequency is coupled into the photodiode of the receiver. Photodiodes
for the receivers are made with SiGe, which are used in modern CMOS processes
Figure 2-3: SEM image of a modulator. On the sides are vertical couplers for the
input/output laser light, and the dark rectangles close to the ring are both PIN diodes
used for modulating the ring. [1]
to increase mobility in PMOS transistors by introducing strain. The percentage of
Ge determines the bandgap of the material, so with more Ge the photodiode will be
able to absorb longer wavelengths. In this process we are assuming the mole-fraction
of Ge in the SiGe alloy is approximately 30%, which corresponds to a wavelength
of about 1200nm. When the light is absorbed, an electron-hole pair are created,
which are swept out due to the electric field from the P-N junction, and picked up by
receiver circuits. Since these small photodiodes are integrated on chip, the wires in
between the diodes and the receiver circuits can be very short, on the order of tens of
mircometers. Compared to connecting to a photodiode off chip through a bond pad,
this reduces the parasitic capacitance of the connection by two orders of magnitude.
Another benefit of integrated photonics is that we are able to have the photodiode
itself be a ring resonator coupled to the waveguide, so theoretically we can achieve
a reasonable responsivity with a relatively small photodiode, which corresponds to a
smaller parasitic capacitance for the diode itself.
2.4
Optical Pulse Source
So far clock transmission or clock recovery has been achieved by several methods based
on self-pulsating lasers [17] [18] [19] [25], mode-locked lasers [21] [22], or optoelectronic
phase-lock loop (PLL) technologies [23] [24].
The complexity and area overhead
of having all-optical phase comparators on a multicore processor is too great, so
optoelectronic PLLs are better suited for intra-system links of data rates of 100Gb/s.
This is also the case with using self-pulsating lasers to recover the clock from a data
stream. We wish to leave all the complexity at the laser source, and have the photonics
on chip be as small as possible. A mode locked laser produces a train of short pulses of
light by inducing a fixed phase relationship between the modes of the laser's resonant
cavity. A laser may have a number of modes of equal frequency separation, depending
on the gain medium bandwidth, and the cavity length. By fixing the relative phase
between each mode, the output amplitudes will periodically all constructively interfere
with one another, producing a pulse of light with high intensity. These pulses will be
separated in time by the round trip time T = 2L/c, where L is the length of the laser
cavity. Semiconductor mode-locked lasers are attractive candidates for generating
picosecond pulses at high repetition rates ( 10GHz) due to their small size, inherent
robustness and possibility of integration with other semiconductor elements [211. It
has been shown that a monolithic semiconductor laser can achieve a 10GHz repetition
rate with a 6.6 picosecond pulse width and 50 to 570 femtosecond jitter, depending
on the existence of an RF electrical reference. Although many of the lasers discussed
are in the 1550nm wavelength range, which is un-suitable for our purposes, there
have been more recent efforts toward making high repetition mode-locked lasers with
vertical cavity surface emitting lasers (VCSELs) [26], or quantum-dot GaAs/InAs
lasers [27] that are more suitable for our wavelength ranges.
2.5
Summary
In this chapter, we cover the photonics components needed to make a WDM link,
and the benefits or difficulties of designing each component in a monolithic CMOS
process. Different devices are suitable for different architectures, so we try to explore
all the trade off relations between design parameters for each device. Lastly a few
candidates for the optical pulse source are presented and discussed.
22
Chapter 3
Clocking
All synchronous electrical systems require a clock signal, whether it is for synchronizing communication channels in high speed links or for synchronizing functional
blocks in a microprocessor. The important merits of a clock distribution scheme are
skew, jitter, rise/fall time, and power consumption. Skew is when the same clock
signal arrives at different nodes at different times, although they were designed to
arrive simultaneously. Jitter is the random fluctuation of the clock edge. These both
limit the maximum operation frequency the chip. We will discuss how clocking is
done electrically in both links and synchronous logic, and how using optics will be
beneficial.
3.1
Electrical Clocking for Links
High speed links usually transmit a high speed clock synchronously with a parallel
data bus, as shown in 3-1. The clock receiver uses a phase locked loop (PLL) or
delay locked loop (DLL) to capture the edge information, and generate a sampling
clock edge aligned with the eye opening of the input data stream. Having the data
receivers sample at the phase where the eye opening is widest provides the largest
signal to noise ratio (SNR), resulting in the lowest bit error rate (BER). In the case
of using a PLL, a lower frequency clock signal can be used, saving the dynamic power
consumed on that wire.
refClk
refClk
data[O]
data[n]
data[]
data[n]
TxClk
RxCIk
D0
timing
margins
Figure 3-1: Conventional source-synchronous unidirectional and differential point-topoint parallel link, where the clock is sent along with data for easy receiver timing
recovery. [4]
To see how skew and
jitter
affects this link, a simple example is shown in Figure
3-2. Due to interconnect mismatches and possibly circuit mismatches, the timing for
each channel might shift relative to the reference clock, so samples are taken away
from widest opening of the eye, lowering the SNR, and resulting in a higher BER.
Fortunately, this phase error is static, so there are circuit techniques to compensate
for it. One way is to have an adjustable delay chain between the clock receiver output
and the clock input of the data receiver, and compensating for the skew after a startup calibration sequence. The adjustable delay chain may be realized by adjusting
[341,
[35].
the delay for each stage, activating a different number of stages [33]
phase interpolation and selecting different combinations of phases
or by using
Jitter is a random phase variation in a signal, caused by power supply noise and
thermal in the transmitter and receiver, and interference coupled in the channels.
This random cycle to cycle variation in clock edges shrinks the eye opening, lowering
the SNR, resulting in a higher BER.
The performance of the link depends on the receiver timing recovery, and as mentioned in the introduction, the channels do not improve with technology. Also, the
large area and power consumption of low jitter PLLs and DLLs adds to the difficultly
Transmitter Timing
refCik
refCIk
data[O]
Receiver Timing
DO
D
02 D03
dataO] D29 DOi
2
Sdat
data[n]
TxClk
Dn
Dni Dn2 Dn3
data[n]J NO DiDd Dn3
timi
marg ns
RxCik
Figure 3-2: Inter-signal skew reduces timing margins at receiver. As inter-signal skew
increases, samples are taken farther away from widest opening of eye, lowering the
SNR, resulting in a higher BER. [4]
of increasing the number of endpoints in a communication network and hence the
number of cores.
3.2
Optical Clocking for Links
Transmitter
Slaser
Receiver
refClk
0
RxClk
data[0]
data[n]
0
Figure 3-3: A simple example of a point-to-point optical WDM link. The rings on
the transmitter side are modulators driven by the driver circuits, and the rings on the
right couple the selected wavelength into photodiodes that are connected to individual
receivers.
A simple example of a point-to-point optical WDM source forwarded link is shown
in Figure 3-3. The photonic structure is the same as the WDM link shown in Figure
1-1, with driver and receive circuits added. Here we use one of the wavelengths
for clock transmission, and the received clock signal triggers the samplers of the
other data receivers. Using optics provides many benefits; Each waveguide can carry
multiple wavelengths without crosstalk or power rail injected noise. Secondly, with
low loss waveguides. the signals will not require re-buffering, hence reducing the jitter
injected. This means that the receiver does not need to have a PLL or DLL to provide
a stable clock, greatly reducing the power and area overhead. The design of a simple,
low power clock receiver will be the focus of the next chapter.
3.3
Electrical Clocking for Synchronous Logic
Data Register
Data Register
R1
-+D
R2
Q
Combinational -+D
LogicQ
Q
+
CLK
Figure 3-4: Simple example of a synchronous pipeline datapath. [5]
We have already discussed how skew and jitter affect a high speed link, now to
see how they affect a synchronous system, a simple example is shown in Figure 3-4.
At the initial rising clock edge, the value of
Q
is copied from D and held at that
value after a delay of tc-Q for both data registers. This new value of
Q causes
the
combinational logic to change its output state, and it settles to a final value after a
propagation delay of tiogic. Then after holding the output value of the combinational
logic for at least the setup time tctp of data register R2, the value of D will be
transferred to
Q at
the next rising clock edge.
So to ensure that data register R2 stores the result correctly, the duration between
the two rising clock edges must be greater than the sum of these three t values, or:
1|fe1ock
Teick >=
tC--Q
+
tiogic
+
tsetup
In the case of skew, we see that if the clock edge of R1 arrives earlier than R2, then
the logic has more time to settle, so it seems fine. However in general, there are data
paths in both directions between clock grids, so in the other direction the skew eats
into the propagation time. Thus the clock frequency must be slowed down enough to
accommodate the worst case skew. In the case of jitter, we can see that having a clock
edge arriving early immediately after a clock edge arriving late squeezes the propagation time, so the clock frequency must be slowed down enough to accommodate the
worst case jitter as well.
Traditionally, electrical clocks are generated by PLLs, synchronized by DLLs, and
distributed on the chip with trees and/or grids of metal wires. Skew can be caused
by transistor mismatch in the routing, and jitter is mainly caused by power supply
noise. A simple illustration is shown in Figure 3-5 The jitter of clock networks in
processors has lowered over the years, and in a recent example [8], the 9-sigma jitter
is down to 100ps, with a skew of 15ps and 3W power consumption for the entire clock
network. Higher speed clocks require more repeaters in order to maintain reasonably
sharp rise/fall edges, which cause more skew, jitter and power consumption.
Lower frequency electrical clock
Figure 3-5: Diagram of purely electrical clock distribution network.
Note that due to the large area and power consumption of the PLLs, they are
implemented at a high level. As an example, 8PLLs are used on a 503mm 2 die in [8].
3.4
Optical Clocking for Synchronous Logic
Besides the benefits of optical communication mentioned above, light may also be
used for clocking synchronous logic. The idea of optical clocking was first published
by J. W. Goodman in 1984 [9], and others [10, 11, 12, 13, 14] demonstrated different
optical clock distribution schemes. The most recent of these is receiverless clocking
[14], and although the two clock distribution schemes are different, some comparisons
are made between them, just to put this work in context.
One main motivation for optical clocking is the mode locked laser, which can have
picosecond pulses, 10GHz or higher repetition rates, and sub-picosecond jitter, as
demonstrated in [15]. Monolithically integrated mode-locked laser diodes with subpicosecond jitter have been demonstrated as well, which makes having a computing
system with mode locked lasers possible [16].
optical receivers power efficiency
1
1.00
2
0.97
4
0.94
0.91
8
0.89
16
0.86
32
64
0.83
128
0.81
256
0.78
power mismatch
0%
t2%
+4%
t6%
±8%
±10%
±13%
±15%
t17%
Table 3.1: Power Efficiency and Mismatch relative to Optical H-Tree Depth
Since the laser source can have a high peak power, on the order of 100mW, we can
also easily branch the clock signal. A splitting loss of less than 3% with a ratio of
49:51 has been demonstrated in a process similar to standard CMOS processes [5].
This enables us to have an optical h-tree with a large fanout, and still maintain a
good power efficiency and low mismatch at the receiver end. The relationship between
power efficiency and mismatch relative to optical h-tree depth is shown in Table 3. 1.
In the case where there is zero area power overhead for clock receivers, since waveguide losses are very small, the optimal condition is to have optical RX at every clock
grid, basically directly driving the grid wire cap, and the gate cap of the static and
dynamic latches we wish to clock. This configuration is shown in Figure 3-6.
Figure 3-6: Diagram of purely optical clock distribution network.
But of course since there is a finite area and power consumption overhead for the
receiver, the optimal point will be somewhere in between, with both optical h-tree
and electrical buffering, as shown in Figure 3-7.
With simple receivers that can be small in area and power consumption, we can
optically split the clean clock signal further down the distribution tree before we
transfer it to an electrical signal. This corresponds to fewer stages of repeaters the
clock signal goes through after the receiver, which means less added jitter.
Please note that a clock source using a single modulated wavelength as used in the
links may be used for global clock distribution instead of a mode locked laser. This
alleviates the need for a separate waveguide, at the cost of more jitter.
Figure 3-7: Diagram of hybrid optical and electrical clock distribution network.
3.5
Goals of an Optical Clock Receiver
The simplest possible receiver would be the receiverless clocking scheme [14], in which
two photodiodes are connected in series from supply to ground, and by shining light
pulses at them alternately, a clock signal can be generated. The only circuitry between
the photodiodes and the clocked node are buffers, so a minimal amount of jitter is
added. Also, less circuits means less circuit power consumption. Since in our design
we use on-chip waveguides, we cannot afford the area to have two waveguides leading
into two photodiodes at many places.
Lastly, in order to swing the capacitance
of two photodiodes rail-to-rail, the quantum efficiency of the photodiode must be
sufficiently high or else the optical power required will be large. (Since the average
optical power P, = C * VDD/(TB * R), where VDD is the supply voltage, C is the
node capacitance, R is detector responsivity and TB is the bit period.) For a 5GHz
clock and a reasonable photodiode, VDD = 1V, C = 20fF, R = 0.5A/W, TB = 100ps,
then the required average optical power is 0.4mW, or -3.9dBm. Unfortunately, the
quantum efficiency of the SiGe photodiodes created in the CMOS flow may not be able
to reach 0.5A/W. So the overall power including optical power may not be optimal.
Due to these reasons, a more sensitive but still simple receiver is desired.
3.6
Summary
In this chapter we discuss the methods currently used for clocking in high speed
links and microprocessor chips. We discover that optical clocks can have much better
skew and jitter performance, due to the superior signal integrity of optics compared
to electrical wires. This allows us to use a simpler circuit without PLLs or DLLs,
greatly reducing the power and area overhead for a multi-core processor.
32
Chapter 4
Optical Clock Receiver Circuit
Design
Now that the reasons for having a small receiver circuit to amplify the photodiode
signal are presented, let us first investigate the commonly used Trans-Impedance
Amplifier (TIA). In its simplest form, a single stage TIA can be implemented with
two transistors and a resistor, as shown in Figure 4-1.
Vin
Vout
Figure 4-1: Schematic of a simple Trans-Impedance Amplifier.
For simpler analysis, let us simplify the circuit further into the one shown in Figure
4-2.
Vin
R
Vout
Figure 4-2: Schematic of an amplifier of finite gain A with feedback. R is the feedback
resistor, C is the input capacitance, I is the input current.
With an amplifier of finite gain A, the output voltage is determined by the differential input. Since the input node is the dominant pole due to the large input
capacitance provided by the photodiode, we can determine the sensitivity of the circuit by looking at the slope of the input node voltage Vin. In this circuit, we can see
that initially, when fin = 0, then Vout = Vin = 0. If Iin is changed to some finite
value, all the current goes to charge up the input capacitance C, so Vin begins to
rise with a slope of in/C. Now because of the feedback path, some, and eventually
all of the current goes through the resistor, hence the voltage at the input settles at
(R * Iin)/(1+ A), the output settles at (-R *Iin * A)/(1+ A), and the voltage across
the resistor is R * Iin as it should be. This transient voltage plot is shown in Figure
4-3.
In the case where R = infinite, it is equivalent to removing the feedback resistor
and having an open between input and output, as shown in Figure 4-4? Now all of
Iin will be directed toward charging up the capacitance, so no current will be "stolen"
by the resistor. Since the input current and input capacitance are determined by the
photodiode, the maximum slope that can be achieved is fin/C.
Now, the reason why the slope is so important is it determines the effect of voltage
noise on jitter. Voltage noise shifts the voltage points up and down by a random
amount, so the timing when the voltage crosses the switching threshold shifts left
Vin
R=oo
Slope =
1_in/C
r-
R*l in
(1+A)
R=finite
time
Figure 4-3: Transient plots of input voltage for when R is finite and when R is infinite.
and right in the time domain. With a steep slope, the same amount of voltage noise
corresponds to less jitter, as shown on the left of Figure 4-5.
So similar to the circuit in Figure 4-2, in the TIA circuit in Figure 4-1, the gain
of a stage is limited by the resistance of the feedback resistor R, and that in turn is
limited by the bandwidth the TIA is supposed to function at. The dominant pole is
the large input capacitance, so a low inpedance is required for the receiver to operate
at high bandwidth.
4.1
4.1.1
Injection Locked Loop
Simple Integrating Receiver
To increase the sensitivity, we could just simply take away the resistor, and let the
input current be integrated on the capacitive node, like domino circuits. Since this is
a clock receiver, there is no external clock to reset this capacitive node, so there must
be some feature that resets the node after a clock edge is detected. As shown in Figure
4-6, we use an NMOS transistor to reset the input node. This essentially breaks the
Vin
Vout
Figure 4-4: Schematic of an amplifier of finite gain A without feedback. C is the
input capacitance, I is the input current.
Input-referred
voltage noise
jitter
Input-referred
voltage noise
jitter
Figure 4-5: Diagram explaining the effect of voltage noise on jitter. A steep slope has
less jitter for the same input noise.
trade off between gain and bandwidth for a TIA. By having two states - a high gain
state and a reset state - we can achieve high sensitivity when we are expecting edge
information, and have a high bandwidth reset state for higher operation frequency.
For simulation, the photodiode is modeled as a current source and a 10fF capacitance in parallel. Although the mode lock laser can provide very short pulses in the
sub-picosecond range, there is still a finite transit time of the photodiode itself. In
the photodiode design implemented in this chip, we can expect the transit time to
be on the order of Witrinsic/Vhsat, which is the intrinsic region width divided by the
hole saturation velocity, which is about 5ps. For simulations, the current pulse is
out
Figure 4-6: Schematic of simple clock receiver with reset NMOS.
modeled as a triangular pulse with a half-max width of 10ps, which is a conservative
estimate. The frequency is also slowed down to 5GHz so the waveforms are easily
readable. Please note that the same circuit can also be used if the optical input is
a single wavelength modulated clock pulse, as we are integrating the input current
onto a capacitor. The slope will be worse, so there will be more jitter, but still much
better than electrical clock distribution.
The transient operation of the circuit is shown in Figure 4-7. When an optical
pulse hits the photodiode, the generated electron hole pairs create a current, charging
up node in. The voltage change at node in is determined by
AV = (R *f' 0 P(t)dt)/Cin
Where R is the responsivity of the photodiode ( A/W ), to to treset is the time
period between when the optical pulse arrives to when the reset is triggered. P(t)
is the instantaneous optical power that reaches the photodiode, and Cin is the total
capacitance at node in. In this case, we see that AV is VDD, or 1V, which causes
the inverters to change their output voltage. After two inverter delays, the voltage at
node reset goes high, turning on the reset NMOS, pulling node in back down to a low
voltage. In this circuit, skew is determined by the process mismatch of the inverter
chain, which affects the inverter delays. Jitter is mostly determined by the power
supply noise injected through the inverters. So obviously more stages of inverters
EyeDiagram
12
EyeDiagram
--
1---4
1
1.
-.--------
. . .-
...
-- -
08
......
....
..
>04
>4
-
0.2
-- -
...
...
0 2
.
.........
0 62 ---.-
--.
.....
00...
- ------------ - ---
-------
-
-
0o
..
. .......
.
--
r
_
2
60
-0.2
4
5
05
-- -----.---..
.....
0
1
2
1..-.----.--.
..........
....
3
0
EyeDiagram
10
.---.---
--..
.------
__
6
Time (seconds)
Eye Diagram
10 -
........
ri __
- -
Time(seconds)
1 5.
...
.. ...
4
5
6
0.....
5.-
0
1
2
3
4
5
6
Figure 4-7: Transient waveforms at node in (top left) and out (top right) with respect
to the switching threshold of the first inverter Vm and optical pulses.
correspond to more skew and jitter. Notice that since the pulse width is independent
of the pulse frequency, so the duty cycle will have to be controlled by adjusting the
length of the inverter chain, or by using a divide by 2 circuit at the output.
Integrating Receiver with Tunable Reset
4.1.2
Since the node in from Figure 4-6 swings from OV to 1V, the power consumption is
the same as the receiverless design. To control how far below Vm node in reaches, we
stack a DAC of NMOS transistors with the reset NMOS, creating a tunable strength
reset NMOS stack, as shown in Figure 4-8.
Vbias
Figure 4-8: Schematic of simple clock receiver with a tunable reset NMOS stack.
06
EyeDiagram
EyeDiagram
0.65 -
-. -
0.55
05
045 -
0
1
2
3
Time(seconds)
6
5
x1S
Time(seconds)
x10
EyeDiagram
410-
-
0
4
1
2
3
Time(seconds)
- ---
4
4
5
0
3
Time(seconds)
4
6
5
010
Figure 4-9: Transient waveforms at node in (top left) and out (top right) with respect
to the switching threshold of the first inverter Vm and optical pulses.
The transient operation of the circuit is shown in Figure 4-9. Note that since
we have a smaller voltage swing, the optical pulse power required is greatly reduced.
Also note that any voltage noise at node in will be translated into jitter in time by
the slope at which the voltage crosses Vm, which is determined by R * P/C.
tunable NMOS stack
b[O]
b[1]
b[2
b[3]
b[4]
b[5]
Vbias
M=1
M=1
M=2
M=4
M=8
M=16
Figure 4-10: Schematic of tunable reset NMOS stack.
A schematic of the tunable NMOS stack is shown in Figure 4-10. The lower row of
NMOS transistors are binary weighted current sources. Due to the small sizes of these
transistors, they are vulnerable to mismatch, so the smallest transistor is duplicated to
improve the linearity after calibration. Programmable registers control which current
sources are connected the the reset NMOS through minimum size transistor switches.
During the integration step, the reset NMOS is off, so the input node is isolated from
the parasitic capacitance of the current sources. Hence we are able to have a large
tuning range without heavily increasing the capacitive load to the input node or the
reset node.
4.1.3
Integrating Receiver with Precharge - Injection-Locked
To increase the operation frequency, we add a PMOS precharge branch that turns on
after the input node discharges past Vm, as shown in Figure 4-11. This allows us to
discharge the input node much faster with a stronger NMOS stack, resulting in a net
gain of speed, as shown in Figure 4-12.
By looking carefully at the waveform of node in, we see that at the moment before
the clock pulse comes in, it is always pre-charged to just below Vm, so the optical
charge is always just enough to make node in go past Vm.
The slope at which
the voltage at node in crosses V,, is now (Iprecharge + R * P)/C.. Another way
of understanding this circuit is that the first two inverters and the discharge/reset
stacks form a three inverter ring oscillator, with a self oscillation frequency set to
be slightly lower than the input clock frequency, which is exactly an injection-locked
loop. Depending on the settings of the discharge/reset stacks, the ring oscillator will
lock onto a certain range of (input current, frequency) values.
reset
1
in
2
3
4
Vbiasn
Figure 4-11: Schematic of manually tuned clock receiver circuit with PMOS precharge
branch.
EyeDiagram
EyeDiagram
4-
0
EyeDiagram
0.5
1
1.5
2
2.5
3
Time(seconds)
35
4
4.5
5
0
a-1oa
0.5
1
1.5
2
2.5
3
Time(seconds)
35
4
45
5
alS-1
Figure 4-12: Transient waveforms at node in (top left) and out (top right) with
respect to the switching threshold of the first inverter Vm and optical pulses.
42
4.2
Auto-locking design
In monte carlo simulations, we are always able to achieve a lock by adjusting the
configuration bits. However in an actual chip, it would be much less desirable to
calibrate every receiver for every die. Also, the amount of optical power that reaches
each receiver may vary after the chip is packaged and optically connected to other
chips, so calibration may be required after a system is put together. Finally, and
perhaps the biggest issue is that the properties of photodiodes, lasers, modulators
and even transistors change with temperature, and the temperature variation on a
microprocessor die can be 50 or more degrees [28]. This gives us a motivation to have
a circuit that can adapt to a fluctuating environment.
Vauto
reset
Vauto
out
in
-0---
-
--
Vauto
Figure 4-13: Schematic of auto-locking clock receiver circuit. The node 'Vauto' coilnects the capacitor to the gates of the tuning transistors, adjusting their strength.
The auto-locking version is shown in Figure 4-13. If the NMOS branch is too
strong, the receiver takes more time, maybe more than a cycle, to pull the input node
back up to Vm, so the duty cycle is less than 50%. Similarly, if the NMOS is too
weak, the duty cycle is greater than 50%. Using this relationship, we can integrate
this duty cycle difference onto a capacitor, so the strengths of the NMOS and PMOS
are gradually tuned until a lock is achieved, as shown in Figure 4-14. When locked,
the voltage at Vauto will stay constant if the duty cycles are inversely proportional to
the current source current values, so ideally, if the current sources are matched, then
the duty cycle will be 50%. The simulated receiver sensitivity is -14dBm at 9GHz,
consuming 77.14iW and generating jitter within 0.15ps. In comparasion, to operate
at 9GHz, the receiverless design in [14] requires -1.4dBm optical power. A photodiode
responsivity of 0.5A/W is assumed in both cases.
1: /vrn..prime
Figure 4-14: Simulated transient waveform of in, output(out4), and Vauto(vm-prime).
Before 60ns, the reset NMOS is too weak, so the input node is constantly above Vm,
hence out4 is stuck at Vdd. Vauto slowly rises until lock is achieved at around 60ns
and stabilizes at about 65 ns.
In the extreme case of process variations and mismatches, the clock receiver may
riot achieve a lock at the desired frequency and laser power. For instance, if the two
current sources are not matched, then the voltage on Vauto will continue to change
after a lock is achieved. This in turn changes the strengths of the reset transistors,
which changes the duty cycle to compensate for the ratio mismatch, and settles at
some non-50% duty cycle. But if the mismatch is too great, the duty cycle may be
forced to be skewed even further, which may cause the receiver to fail. This can
be avoided by increasing the size of the current mirror transistors, or having a few
bits of tuning capability for one of the current sources. Another feature that could be
made automatic is the dark current compensation. Since the photodiode will produce
some leakage current in the absence of light, a manually tunable current source to
compensate for this DC current was added. The final schematic would be something
like Figure 4-15.
Tunable current source
Vauto
reset
T
Vbias
-
IF
Vauto
Tunable current source
Figure 4-15: Schematic of auto-locking clock receiver circuit with dark current compensation and independently tunable current sources for duty cycle correction.
For the purposes of testing this injection-locked auto-tuning concept, the receiver
circuit implemented in EOS2 was a single ended circuit which is small in area and
power consumption, but obviously more vulnerable to power rail noise. A fully differential version would not only provide power supply noise rejection, but also a higher
operation frequency can be possible with analog style differential pairs, at the cost
of burning more power. An additional benefit is with a differential connection to the
photodiode, the DC voltage across the diode is OV, hence no dark current to worry
about.
4.3
Summary
Here we present the limitations of a traditional TIA receiver, arid how to break the
bandwidth/sensitivity tradeoff by taking advantage of the fact that this is a clock
receiver.
Starting with a simple self-reset design, we add on the injection-locked
feature, then the auto-tuning capability, resulting in a high speed, low power, low
jitter receiver that can adapt to a varying environment.
46
Chapter 5
EOS2 32nm Test Chip
5.1
EOS2 Chip Overview
We completed a test chip in the 32nm process with all the components mentioned
in the photonic technology section, shown in Figure 5-1. To characterize waveguide
loss, we have sets of waveguides of various lengths with identical bends. Thus by
measuring the losses of a set of waveguides, we can decouple the insertion losses of
the vertical couplers and bending losses, leaving us an accurate loss per centimeter
figure. There are also loss rings to measure the loss more accurately if it is very small.
Individual modulators and photodetectors are connected to high speed probe pads
(GSG pads), for a thorough optical-electrical characterization. Modulator rings are
accompanied with heaters, so the relation between frequency shift and temperature
can be measured. Due to the fact that this chip is manufactured in a 32nmn pilot
run, for many of the photonic devices, different designs and sizings are implemented,
such that the variation of unknown design parameters such as doping concentration
are covered by a design. The hope is that if a design of one set of parameters is
tested to be functional, the next iteration of the chip can narrow down the types of
variations implemented. In the center of the chip are WDM photonic links connected
to transmitter and receiver circuits. We have waveguides that connect modulators and
receivers, with heaters for each ring, so multiple channels can operate simultaneously.
Figure 5-1: Layout view of EOS2 test chip. Sandwiched between loss measurement
structures and individually testable optical devices are the circuit/photonic WDM
rows. The chip dimensions are 2mm by 2mm.
48
5.2
EOS2 Cell
Clock Rx (x2)
Modulator (x2) and PRBS
Data Rx (x2)
electronic circuit cellI
Figure 5-2: Layout view of circuit cell of EOS2 with photonic devices. Note the
relative size of the vertical couplers, rings and spacing between photonics and circuits.
Also note the row of etch vias below the photonics. The circuit cell dimensions are
220pm by 40pm.
Since we plan to run the links at 5Gb/s, we have all our test structures on chip,
including a programmable PRBS to generate the data sequence for the modulator,
a data snapshot and counter for error checking, and many registers to store configuration bits for the modulator and both data and clock receivers. With most of the
digital infrastructure on the chip, no high speed outputs are required, and we are area
limited in how many devices we are able to implement and test, instead of being pad
limited. This digital infrastructure with analog blocks is denoted as a "cell", a block
diagram is shown in Figure 5-3, and a layout view with the adjacent photonics is
shown in Figure 5-2. For ease of layout and verification, one unique cell is created,
and then repeated in rows. The photonics, whether WDM links or individual test
structures, are then connected to the regularly spaced pin locations of the clock receivers, data receivers and modulators, making the hookup process simple and easy
to verify.
Scanchain
Buffers
SPh
R
Counter
High-speed clock buffer
Rx at
RxClock
(x2)
Mklodlator-1
Modulat
Figure 5-3: Block diagram of a circuit cell of EOS2. Analog blocks are connected to
photonic devices below the cell, and the high speed support and low speed configuration registers are above.
All scan signals and high speed clocks were buffered in each cel, and then passed
on to the next cell in each row. This is fine for low speed scan signals as long as
we avoid hold time violations, but unfortunately for high speed clocks, the jitter and
duty cycle will deteriorate further down the chain. This problem has been fixed in
the latest chip.
5.3
5.3.1
EOS2 Testing
Clock Receiver Testing
high speed
reference clock
asynchronous reference counter
I I I II
I I II
shutoff signal
asynchronous circuit counter
F77 I I I I I I I
receiver circuit
clock
Figure 5-4: Block diagram of the high speed testing infrastructure for the clock
receiver circuit. The number of bits in each counter are actually 20. (modified to 40
bits in later chip for better accuracy)
The testing infrastructure is set up to measure the output clock frequency of the
receiver circuit relative to an off chip high speed clock source. The implementation
is shown in Figure 5-4. The two asynchronous counters are initially set to all zeros,
then the high speed inputs are enabled from the configuration registers. The counter
count the number of cycles, until the MSB of the reference counter changes to 1.
This is the shutoff signal that quickly stops the input signals from coming in. By
reading out and comparing the two counter values, we can measure the frequency
ratio between the reference clock and the received clock. For instance, if both clocks
are matched, then they should both read out the same value. There will be some
offset in the counter value, since the overflow propagates through every counter bit
and then shuts off the clock input, so a few extra cycles will be counted. This offset
is a fixed delay in time, and can be easily calibrated out.
Also, as a backup in case the photodetectors do not work as well as planned, and for
debugging purposes, an electrical pulsed current source can be enabled as the input
to the clock receiver circuit. This self test circuit is driven by a high speed clock, and
the pulse strength and width can be configured to mimic different photodiode current
profiles.
Measuring the relative frequencies allows us to know whether or not the receiver
is properly locked, but to do jitter measurements, a more sophisticated test circuit is
required. In a later chip, the receiver circuit output is also sampled with a reference
clock, so by adjusting the relative phases, we can derive a cummulative histogram of
received clock edge phases with respect to the reference clock.
5.3.2
Chip Testing
All scan chain control signals are connected to an FPGA board, with a state machine
that can interface with a python command set. The functions include resetting the
chip, reading out the cell values of a target cell, writing a user defined configuration
to a target cell. Although the scan chain is about 20000 registers long, the FPGA
can operate at a MHz frequency, so each operation only takes a fraction of a second.
When reading and writing a target cell, the values of the other cells are preserved,
this allows us to configure multiple cells easily. An example of a write operation is
shown in Figure 5-5.
To perform a purely electrical clock receiver test, we configure a target cell to have
the self test circuit enabled, clock receiver enabled, and high speed clocks enabled.
After the experiment is done, we read out the counter values of that target cell and
adjust the clock receiver configuration if necessary. Once we have confidence the
circuits work with the electrical self test, we can move the chip to our optical table
to perform full optical/electrical tests. For the clock receiver, this includes using a
mode-locked laser as the clock source, and a modulated single wavelength laser source.
Cell 0
Cell 1
Cell 2
Cell 3
time = 0
A
B
C
D
time = 1
D
A
B
C
New
D
A
B
New
D
A
New
D
time = 2
New A
time = 3
time = 4
B4
L>-A
B
Figure 5-5: Here we write the value "New" to Cell 2, while preserving the values of
the other cells. This is done by feeding back the last bit of the scanchain into the
input, and only feeding in new data at the right time.
5.4
Summary
In this chapter we give an overview of the features in the EOS2 chip, the electrical
cell architecture, and the procedures to test the chip. The key is to have most of the
digital infrastructure on the chip, so no high speed outputs are required, and we can
test as many devices as we can fit. When designing a chip with both photonics and
digital backend of this complexity, we see the need to fully integrate the photonics into
our CMOS design flow. This includes parametrized layout generation for photonic
components, photonic cell preparation for automatic place and route, and post route
design rule checking and connectivity checking. The issues will be addressed in the
next chapter.
54
Chapter 6
Integrated CMOS/Photonic Chip
Design
For an electronic/photonic chip to be designed in a truly integrated fashion, there
must be a photonic equivalent to every step of the CMOS design flow. These include
parametrized layout generation or p-cells, macro preparation for automated place and
route, design rule checking (DRC) and layout vs. schematic checking (LVS).
6.1
Parametrized photonic layout generation and
optimization
In this project, we have used p-cells to draw the photonic devices [14]. A p-cell lets
the designer to input the design parameters of a device, and refeclts the changes in
the layout immediately. For instance, for a waveguide bend, the user and specify the
waveguide width, bend radius, bend angle, etc. More parameters can be added later
on if more degrees of freedom are needed. In stream-out, Cadence merges rectangles
into many-vertex polygons (max 4000 vertices). So this means we can draw polygons
with many vertices. One major benefit is redundant vertices are minimized, saving
disk space.
Also, Cadence displays simplified polygons when zoomed out, so the
screen refreshes faster in a complex design. This effect is shown in Figure 6-1.
Figure 6-1: The waveguide on the left is drawn with polygons with many vertices;
the waveguide on the right is drawn with just rectangles.
6.2
Design Automation of Electric and Photonic
Integrated Circuits
For most commercial digital chips, the physical layout is done automatically with
CAD ( Computer Assisted Design) tools. The user starts with a behavioral description
of the chip in Verilog that describes the functionality of the chip, then proceeds to
synthesize the chip into a gate level design, where we have a netlist of logic gates.
The next procedure is floorplaning, where the designer places the power wires, pads,
and higher level module blocks of the chip. Finally a detailed place and route is
performed, where the standard cell layout corresponding to each logic gate are placed
and wired up. Modern CAD tools let the user adjust the optimization parameters,
such as area, clock frequency, signal integrity etc, and warn the designer of bottlenecks
and violations, so that many iterations can be quickly made, saving the designer a
significant amount of time.
For photonics to be integrated in a complex chip, it also needs to be compatible
with this CAD tool flow. A straight forward solution is to treat photonic blocks the
same way as we treat logic gate standard cells. This integrated electric and photonic
circuit design flow is shown in Figure 6-2. So the only additional step is to create a
Chip-level verilog
(instantiation of
LEF macros and
SOC Encounter
Place and route
connectivity)
Figure 6-2: Integrated electric and photonic circuit design flow
macro description (using the industry standard LEF format) for each photonic device
from the layout created in the method described in the previous section. This LEF
file contains information about the device's size, metal layer geometries, and port
geometries for electrical connections. The CAD tool needs to know the size of the
device so that it can be tightly placed without overlapping with other devices. The
metal layer geometries and port geometries allow the CAD tool to route signals to
the correct ports as specified in the Verilog netlist.
As an example, a layout of a ring modulator is shown in Figure 6-3. After labeling the port names, an abstract can be generated, leaving only the essential layer
information, as shown in Figure 6-4. Then from this abstract, we can create a LEF
macro (abbreviated version shown in Figure 6-5), specifying the device name, size,
port locations and metal geometries. To plug this back into the CMOS design flow,
we simply reference these new libraries in hte tool, add instantiations of photonic
macros, and make connections to other macros, all as described in the verilog code.
pp :~
tt91
r
44
9
V
:crJ
Figure 6-3: Layout view of example ring modulator
Figure 6-4: Abstract view of example ring modulator. Note that lower level metal
features such as metal fill have been blocked out, since poit connections are made at
higher levels.
VERSION 5.6;
BUSBITCHAR S "f]"
DIVIDERCHAR "I';
MACRO block-electronic etchrow 1
CLASS BLOCK;
ORIGIN -208 -1794;
FOREIGN blockelectronicetchrow_1 208 1794
SIZE 2488 BY 165 ;
SYMMETRY X Y R90;
PIN heater a_1
DIRECTION INOUT;
USE SIGNAL;
PORT
LAYER u;
RECT 431 1870.5436.51882;
END
END heater a_1
OBS
LAYER ml;
RECT 2081794 2696 1959;
END
END blockelectronicetchrow_1
END LIBRARY
Figure 6-5: Abbreviated macro description in LEF format of example ring modulator.
6.3
Simple design check using LVS and DRC
As we try to build truly integrated photonic-electrical circuits, LVS capability is
crucial, since humans can only be error-free up to a certain complexity level.
To
indicate optical connectivity, we can use a layer that is unused by the electrical
process, so the optical and electrical nets are independent of each other. We also
add additional layers to help us extract optical devices and their parameters. Thus
the extraction tool will be able to recognize photonic devices and compare the layout
(Figure 6-6) with the designed schematic (Figure 6-7).
CMOS
CIRCUIT
Figure 6-6: Layout of the same circuit as in Figure 6-7.
In addition to checking connectivity, by making the optical connectivity layer wider
than the actual waveguide, LVS will detect a short if two waveguides are too close to
each other. Additional DRC checks may be made by adding rules to the DRC rules
file. For example, we can check if the waveguides have the proper silicide, metal fill
and dopant blocking layers overlapping them.
CMOS
CIRCUIT
T
Figure 6-7: Schematic of circuit with both photonic and electrical components.
6.4
Summary
This chapter focuses on the integration of photonics with circuits from a design perspective. Using existing tools, we develop methods to draw photonic devices, verify
the circuit and include the photonic devices within a standard CMOS place and route
flow.
62
Chapter 7
Conclusion and Future Work
7.1
Conclusion
The bottleneck of multi-core processors performance will be the I/O, for both onchip core-to-core I/O, and off-chip core-to-memory. Integrated silicon photonics can
potentially provide high-bandwidth low-power signal and clock distribution for multicore processors, by exploiting wavelength-division multiplexing. This thesis presents
the technology environment of the monolithic optical/electrical chip, and then focuses
on how an optical method would look like for both source-synchronous link and
for on-chip global clock distribution. The injection-locked loop clock receiver that
suits this architecture breaks the bandwidth/sensitivity tradeoff, and a self adjusting
mechanism is added to increase robustness. The simulated receiver sensitivity is 14dBm at 9GHz, consuming 77.14pW and generating jitter within 0. 15ps when locked
onto a mode-locked laser clock source. We then present the chip infrastructure and
testing procedures, also methods to make the photonic device design more compatible
with a CMOS circuit design flow, putting us closer to true integration.
7.2
Future Work
As already mentioned in chapter 4, a fully differential receiver circuit would be less
vulnerable to power supply noise, and can potentially operate at a higher frequency.
Another technique to explore is to have a frequency dividing mechanism for the reset,
so the electrical clock frequency can be an integer multiple of the input optical clock
source. Although the jitter will increase since the phase noise is not cleaned up every
cycle, this will allow us to use mode-locked lasers of a lower repetition rate, which are
more easily accessible.
Another improvement that is orthogonal to the two mentioned above, is to explore
the possibility of using a phase comparator as the reset/precharge strength tuning
mechanism. This will help us decouple the two features that we wish to tune, frequency and duty cycle, and provide a larger range of operation. For instance, in the
current design, if the optical input frequency is much lower than the designed range,
the duty cycle will be much less than 50% if locked, so the auto tune will push itself
out of lock. However if we use a phase comparison mechanism, it will stay locked
once a lock is achieved, and can proceed to fix the duty cycle. The receiver would
still rely on the injected clock source to clean up the phase noise, so the jitter would
still be much better than a PLL.
Bibliography
[1] C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. Holzwarth, M. Popovic,
H. Li, H. Smith, J. Hoyt, F. Kartner, R. Ram, V. Stojanovic, and K. Asanovic,
"Building manycore processor-to-dram networks with monolithic silicon photonics," in High PerformanceInterconnects, 2008. HOTI '08. 16th IEEE Symposium
on, pp. 21-30, Aug. 2008.
[2] C. W. Holzwarth, J. S. Orcutt, H. Li, M. A. Popovic, V. Stojanovic, J. L. Hoyt,
R. J. Ram, and H. I. Smith, "Localized substrate removal technique enabling
strong-confinement microphotonics in bulk si cmos processes, in Conference on
Lasers and Electro-Optics/Quantum Electronics and Laser Science Conference
and Photonic Applications Systems Technologies, p. CThKK5, Optical Society
of America, 2008.
[3] B. Moss, A 10 Gb/s Optical Modulator in 32 nm Bulk-CMOS. PhD thesis,
Massachusetts Institute of Technology, USA, 2009.
[4] E. Yeung and M. Horowitz, "A 2.4 gb/s/pin simultaneous bidirectional parallel
link with per-pin skew compensation," Solid-State Circuits, IEEE Journal of,
vol. 35, pp. 1619 -1628, nov 2000.
[5] D. Ahn, Intrachip Clock Signal Distribution via Si-based Optical Interconnect.
PhD thesis, Massachusetts Institute of Technology, USA, 2006.
[6] C. Gunn, "Cmos photonics for high-speed interconnects," Micro, IEEE, vol. 26,
pp. 58-66, March-April 2006.
[7] T. Barwicz, H. Byun, F. Gan, C. W. Holzwarth, M. A. Popovic, P. T. Rakich,
M. R. Watts, E. P. Ippen, F. X. Kartner, H. I. Smith, J. S. Orcutt, R. J. Ram,
V. Stojanovic, 0. 0. Olubuyide, J. L. Hoyt, S. Spector, M. Geis, M. Grein,
T. Lyszczarz, and J. U. Yoon, "Silicon photonics for compact. energy-efficient
interconnects," J. Opt. Netw., vol. 6, no. 1, pp. 63-73, 2007.
[8] R. Kuppuswamy, S. Sawant, S. Balasubramanian, P. Kaushik, N. Natarajan,
and J. Gilbert, "Over one million tpcc with a 45nm 6-core xeon cpu," in SolidState Circuits Conference - Digest of Technical Papers, 2009. ISSCC 2009. IEEE
International,pp. 70-71,71a, Feb. 2009.
[9] J. Goodman, F. Leonberger, S.-Y. Kung, and R. Athale, "Optical interconnections for vlsi systems," Proceedings of the IEEE, vol. 72, pp. 850-866, July 1984.
[10] P. Delfyett, D. Hartman, and S. Ahmad, "Optical clock distribution using a
mode-locked semiconductor laser diode system," Lightwave Technology, Journal
of, vol. 9, pp. 1646-1649, Dec 1991.
[11] Y. Li, T. Wang, J.-K. Rhee, and L. Wang, "Multigigabits per second board-level
clock distribution schemes using laminated end-tapered fiber bundles," Photonics
Technology Letters, IEEE, vol. 10, pp. 884-886, Jun 1998.
[12] R. Chen, L. Lin, C. Choi, Y. Liu, B. Bihari, L. Wu, S. Tang, R. Wickman,
B. Picor, M. Hibb-Brenner, J. Bristow, and Y. Liu, "Fully embedded boardlevel guided-wave optoelectronic interconnects," Proceedingsof the IEEE, vol. 88,
pp. 780-793, Jun 2000.
[13] S. J. Walker and J. Jahns, "Array generation with multilevel phase gratings," J.
Opt. Soc. Am. A, vol. 7, no. 8, pp. 1509-1513, 1990.
[14] C. Debaes, A. Bhatnagar, D. Agarwal, R. Chen, G. Keeler, N. Helman, H. Thienpont, and D. Miller, "Receiver-less optical clock injection for clock distribution
networks," Selected Topics in Quantum Electronics, IEEE Journal of, vol. 9,
pp. 400-409, March-April 2003.
[15] U. Keller, R. Paschotta, S. Lecomte, R. Haring, L. Krainer, G. Spuhler, and
K. Weingarten, "Novel high-performance pulse generating lasers from 10 ghz
to 160 ghz," in Lasers and Electro-Optics Society, 2002. LEOS 2002. The 15th
Annual Meeting of the IEEE, vol. 2, pp. 469-470 vol.2, Nov. 2002.
[16] H.-J. Lohe, R. Scollo, J. Holzman, F. Robin, D. Erni, W. Vogt, E. Gini, and
H. Jackel, "Comparison of monolithically integrated mode-locked laser diodes
with uni-traveling carrier and multi-quantum well saturable absorbers," in The
18th Annual Meeting of the IEEE Lasers and Electro-Optics Society, 2005.,
pp. 874-875, 2005.
[17] U. Feiste, D. As, and A. Ehrhardt, "18 ghz all-optical frequency locking and
clock recovery using a self-pulsating two-section dfb-laser," Photonics Technology
Letters, IEEE, vol. 6, pp. 106 -108, jan 1994.
[18] W. Mao, Y. Li, M. Al-Mumin, and G. Li, "40 gbit/s all-optical clock recovery
using two-section gain-coupled dfb laser and semiconductor optical amplifier,"
Electronics Letters, vol. 37, pp. 1302 -1303, 11 2001.
[19] M. Leiria, I. Kim, A. Cartaxo, and G. Li, "Experimental study of patternindependent phase noise accumulation in an all-optical clock recovery chain
based on two-section gain-coupled dfb lasers," Lightwave Technology, Journal
of, vol. 26, pp. 1661 -1670, june15, 2008.
[20] W. Mao, Y. Li, M. Al-Mumin, and G. Li, "40 gbit/s all-optical clock recovery
using two-section gain-coupled dfb laser and semiconductor optical amplifier,"
Electronics Letters, vol. 37, pp. 1302 -1303, 11 2001.
[211 K. Yvind, D. Larsson, L. Christiansen, J. Mork, J. Hvam, and J. Hanberg, "Highperformance 10 ghz all-active monolithic modelocked semiconductor lasers,"'
Electronics Letters, vol. 40, pp. 735 - 737, 10 2004.
[22 X. Wang, H. Yokoyama, and T. Shimizu, "Synchronized harmonic frequency
mode-locking with laser diodes through optical pulse train injection," Photonics
Technology Letters, IEEE, vol. 8, pp. 617 -619, may 1996.
[23] C. Boerner, C. Schubert, C. Schmidt, E. Hilliger, V. Marembert, J. Berger,
S. Ferber, E. Dietrich, R. Ludwig, B. Schmauss, and H. Weber, "160 gbit/s clock
recovery with electro-optical pll using bidirectionally operated electroabsorption
modulator as phase comparator," Electronics Letters, vol. 39, pp. 1071 - 1073,
10 2003.
[24] 0. Kamatani and S. Kawanishi, "Ultrahigh-speed clock recovery with phase lock
loop based on four-wave mixing in a traveling-wave laser diode amplifier," Lightwave Technology, Journal of, vol. 14, pp. 1757 -1767, aug 1996.
[25] Y. Li, C. Kim, G. Li, Y. Kaneko, R. Jungerman, and 0. Buccafusca, "Wavelength and polarization insensitive all-optical clock recovery from 96-gb/s data
by using a two-section gain-coupled dfb laser," Photonics Technology Letters,
IEEE, vol. 15, pp. 590 -592, april 2003.
[26] L. Mutter, V. Iakovlev, A. Caliman, A. Mereuta, A. Sirbu, and E. Kapon,
"Phase-locked 1.3 xOOb5;m vcsel arrays based on patterned tunnel junction,"
in Lasers and Electro-Optics 2009 and the European Quantum Electronics Conference. CLEO Europe - EQEC 2009. European Conference on, pp. 1 -1, 14-19
2009.
[27] F. Kefelian, S. O'Donoghue, M. Todaro, J. Mclnerney, and G. Huyet, "High repetition rate monolithic passively mode-locked semiconductor quantum-dot laser:
Investigation of the locking regimes and the rf linewidth," in Lasers and ElectroOptics, 2007. CLEO 2007. Conference on, pp. 1 -2, 6-11 2007.
[28] C. Poirier, R. McGowen, C. Bostak, and S. Naffziger, "Power and temperature control on a 90nm itanium reg;-family processor," in Solid-State Circuits
Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International,
pp. 304 -305 Vol. 1, 10-10 2005.
[29] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,
M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff,
W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney,
and J. Zook, "Tile64 - processor: A 64-core soc with mesh interconnect," in
Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers.
IEEE International,pp. 88 -598, 3-7 2008.
[30] A. S. Leon, K. W. Tarn, J. L. Shin, D. Weisner, and F. Schumacher, "A powerefficient high-throughput 32-thread sparc processor," Solid-State Circuits, IEEE
Journal of, vol. 42, pp. 7 -16, jan. 2007.
[31] D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox,
P. Harvey, P. Harvey, H. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty,
Y. Masubuchi, M. Pham, J. Pille, S. Posluszny, M. Riley, D. Stasiak, M. Suzuoki,
0. Takahashi, J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa, "Overview of
the architecture, circuit design, and physical implementation of a first-generation
cell processor," Solid-State Circuits, IEEE Journal of, vol. 41, pp. 179 - 196, jan.
2006.
[32] D. Wendel, R. Kalla, R. Cargoni, J. Clables, J. Friedrich, R. Frech, J. Kahle,
B. Sinharoy, W. Starke, S. Taylor, S. Weitzel, S. Chu, S. Islam, and V. Zyuban,
"The implementation of power7tm: A highly parallel and scalable multi-core
high-end server processor," in Solid-State Circuits Conference Digest of Technical
Papers (ISSCC), 2010 IEEE International,pp. 102 -103, 7-11 2010.
[33] D. Cecchi, M. Dina, and C. Preuss, "A 1gb/s sci link in 0.8 mu;n bicmos," in
Solid-State Circuits Conference, 1995. Digest of Technical Papers. 42nd ISSCC,
1995 IEEE International,pp. 326 -327, 15-17 1995.
[34] T. Sato, Y. Nishio, T. Sugano, and Y. Nakagome, "A 5-gbyte/s data-transfer
scheme with bit-to-bit skew control for synchronous dram," Solid-State Circuits,
IEEE Journal of, vol. 34, pp. 653 -660, may 1999.
[35] K. Gotoh, H. Tamura, H. Takauchi, T. S. Cheung, W. Gai, Y. Koyanagi,
R. Schober, R. Sastry, and F. Chen, "A 2b parallel 1.25 gb/s interconnect i/o
interface with self-configurable link and plesiochronous clocking," in Solid-State
Circuits Conference, 1999. Digest of Technical Papers. ISSCC. 1999 IEEE International,pp. 180 -181, 1999.
[36] J. Orcutt and R. Ram, "Photonic device layout within the foundry cmos design environment," Photonics Technology Letters, IEEE, vol. 22, pp. 544 -546,
april15, 2010.