* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download thesis
Electronic engineering wikipedia , lookup
Pulse-width modulation wikipedia , lookup
Telecommunications engineering wikipedia , lookup
Resistive opto-isolator wikipedia , lookup
Buck converter wikipedia , lookup
Mains electricity wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Alternating current wikipedia , lookup
Flip-flop (electronics) wikipedia , lookup
Atomic clock wikipedia , lookup
Immunity-aware programming wikipedia , lookup
Optical rectenna wikipedia , lookup
Rectiverter wikipedia , lookup
Regenerative circuit wikipedia , lookup
A 9GHz Injection Locked Loop Optical Clock Receiver in 32-nm CMOS OFTi I by L CT J 5 2010 LIBRARIES Jonathan Chung Leu Submitted to the Department of Electrical Engineering and Computer Science ARCHIVES in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2010 @ Massachusetts Institute of Technology 2010. All rights reserved. A uthor ........ . ............. ............. .................... Department of Electrical Engineering and Computer Science August 16, 2010 Certified by ........ 4~,4/4~ Y77 6~ ,-. Vladimir Stojanovid Associate Professor Thesis Supervisor ,-: 7/ Accepted by............ Terry P. Orlando Chairman, Department Committee on Graduate Theses E A 9GHz Injection Locked Loop Optical Clock Receiver in 32-nm CMOS by Jonathan Chung Leu Submitted to the Department of Electrical Engineering and Computer Science on August 16, 2010, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract The bottleneck of multi-core processors performance will be the I/O, for both onchip core-to-core I/0, and off-chip core-to-memory. Integrated silicon photonics can potentially provide high-bandwidth low-power signal and clock distribution for multicore processors, by exploiting wavelength-division multiplexing. This thesis presents the technology environment of the monolithic optical/electrical chip, and then focuses on how an optical method would look like for both source-synchronous link and for on-chip global clock distribution. The injection-locked loop clock receiver that suits this architecture breaks the bandwidth/sensitivity tradeoff, and a self-adjusting mechanism is added to increase robustness. The simulated receiver sensitivity is 14dBm at 9GHz, consuming 77.14pW and generating jitter within 0. 15ps when locked onto a mode-locked laser clock source. The chip infrastructure and testing procedures are then presented, and lastly a truly integrated optical-electrical design flow is shown as well. Thesis Supervisor: Vladimir Stojanovid Title: Associate Professor Acknowledgments thanks to MIT, prof, lab mates, CBCGB/COM, family and friends fi Contents 1 Introduction 17 2 Photonic Technology 3 4 2.1 Laser Source, Vertical Couplers and Optical Waveguides . . . . . . . 17 2.2 Optical Modulators...... . . . . . . . . . . . . . .. . . . . . . . 18 2.3 Photodiodes . . . . . . . . . . . . . . . . . . . . . .. . . . . -. - -- 19 2.4 Optical Pulse Source . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 23 Clocking 3.1 Electrical Clocking for Links . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Optical Clocking for Links . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Electrical Clocking for Synchronous Logic . . . . . . . . . . . . . . . 26 3.4 Optical Clocking for Synchronous Logic . . . . ...... . . . . . . . 28 3.5 Goals of an Optical Clock Receiver . . . . . . . . . . . . . . . . . . . 30 3.6 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 33 Optical Clock Receiver Circuit Design 4.1 4.2 Injection Locked Loop . . . . . . . . . . . . . . . . . . . . . . . . . 35 . . . . . . . . . . . . . . . 4.1.1 Simple Integrating Receiver 4.1.2 Integrating Receiver with Tunable Reset 4.1.3 Integrating Receiver with Precharge - Injection-Locked . . . . . . . . Auto-locking design . . . . . . . . . . . . . . . . . . . . . . . . 35 . . . 39 . . . 41 . . . 43 4.3 5 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . EOS2 32nm Test Chip 45 47 5.1 EOS2 Chip Overview 5.2 EOS2 Cell ........ ................................. 49 5.3 EO S2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.1 Clock Receiver Testing . . . . . . . . . . . . . . . . . . . . . . 51 5.3.2 Chip Testing 52 5.4 . ..... ........................... . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary......... . . . . . . . . . . . . . . . . . . . . . . . . 6 Integrated CMOS/Photonic Chip Design 47 53 55 6.1 Parametrized photonic layout generation and optimization . . . . . . 55 6.2 Design Automation of Electric and Photonic Integrated Circuits . . . 56 6.3 Simple design check using LVS and DRC . . . . . . . . . . . . . . . . 60 6.4 Sum m ary 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusion and Future Work 7.1 Conclusion.......... . . . . 7.2 Future Work......... . . . . . . . . 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 List of Figures 1-1 An intra-chip integrated photonic link implementing WDM. An inter- chip link may be implemented similarly. [1] . . . . . . . . . . . . . . . 2-1 Cross-section and SEM image of the waveguides and air gap. Note that the air gap does not reach the transistors. [1, 2] 2-2 . . . . . . . . . 18 Illustration of ring modulator. Note that diodes were only implemented on the straight sections of the ring due to fabrication constraints. [3] 2-3 16 19 SEM image of a modulator. On the sides are vertical couplers for the input/output laser light, and the dark rectangles close to the ring are both PIN diodes used for modulating the ring. [1]... . . . 3-1 . ... 20 Conventional source-synchronous unidirectional and differential pointto-point parallel link, where the clock is sent along with data for easy receiver timing recovery. [4] 3-2 . . . . . . . . . . . . . . . . . . . . . . . 24 Inter-signal skew reduces timing margins at receiver. As inter-signal skew increases, samples are taken farther away from widest opening of eye, lowering the SNR, resulting in a higher BER. [4] . . . . . . . . . 3-3 25 A simple example of a point-to-point optical WDM link. The rings on the transmitter side are modulators driven by the driver circuits, and the rings on the right couple the selected wavelength into photodiodes . . . . . . . . . . . . . . . 25 3-4 Simple example of a synchronous pipeline datapath. [5] . . . . . . . . 26 3-5 Diagram of purely electrical clock distribution network. . . . . . . . . 27 3-6 Diagram of purely optical clock distribution network. . . . . . . . . . 29 that are connected to individual receivers. 3-7 Diagram of hybrid optical and electrical clock distribution network. . 30 4-1 Schematic of a simple Trans-Inpedance Amplifier.......... . 33 4-2 Schematic of an amplifier of finite gain A with feedback. . R is the feedback resistor, C is the input capacitance, I is the input current. 4-3 .. ... ... .. ... .. .... . . . . ... 35 Schematic of an amplifier of finite gain A without feedback. C is the input capacitance, I is the input current. . . . . . . . . . . . . . . . . 4-5 34 Transient plots of input voltage for when R is finite and when R is infinite......... 4-4 . 36 Diagram explaining the effect of voltage noise on jitter. A steep slope has less jitter for the same input noise. . . . . . . . . . . . . . . . . . 36 4-6 Schematic of simple clock receiver with reset NMOS. 37 4-7 Transient waveforms at node in (top left) and out (top right) with . . . . . . . . . respect to the switching threshold of the first inverter Vm and optical p u lses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4-8 Schematic of simple clock receiver with a tunable reset NMOS stack. 39 4-9 Transient waveforms at node in (top left) and out (top right) with respect to the switching threshold of the first inverter Vm and optical p u lses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 Schematic of tunable reset NMOS stack.. . . . . . . . . . . . . . 39 40 4-11 Schematic of manually tuned clock receiver circuit with PMOS precharge branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4-12 Transient waveforms at node in (top left) and out (top right) with respect to the switching threshold of the first inverter Vm and optical p ulses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13 Schematic of auto-locking clock receiver circuit. 42 The node 'Vauto' connects the capacitor to the gates of the tuning transistors, adjusting their strength... . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4-14 Simulated transient waveform of in, output(out4), and Vauto(vm-prime). Before 60ns, the reset NMOS is too weak, so the input node is constantly above Vin, hence out4 is stuck at Vdd. Vauto slowly rises until lock is achieved at around 60ns and stabilizes at about 65 ns. . . . . . 44 4-15 Schematic of auto-locking clock receiver circuit with dark current conpensation and independently tunable current sources for duty cycle correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 45 Layout view of EOS2 test chip. Sandwiched between loss measurement structures and individually testable optical devices are the circuit/photonic WDM rows. The chip dimensions are 2mm by 2mm. 5-2 . 48 Layout view of circuit cell of EOS2 with photonic devices. Note the relative size of the vertical couplers, rings and spacing between photonics and circuits. Also note the row of etch vias below the photonics. The circuit cell dimensions are 220im by 40pm. . . . . . . . . . . . . 5-3 49 Block diagram of a circuit cell of EOS2. Analog blocks are connected to photonic devices below the cell, and the high speed support and low speed configuration registers are above............ . . 5-4 . ... 50 Block diagram of the high speed testing infrastructure for the clock receiver circuit. The number of bits in each counter are actually 20. (modified to 40 bits in later chip for better accuracy) 5-5 . . . . . . . . . 51 Here we write the value "New" to Cell 2, while preserving the values of the other cells. This is done by feeding back the last bit of the scanchain into the input, and only feeding in new data at the right time. 53 6-1 The waveguide on the left is drawn with polygons with many vertices; the waveguide on the right is drawn with just rectangles. . . . . . . . 6-2 Integrated electric and photonic circuit design flow...... . ... 6-3 Layout view of example ring modulator . . . . . . . . . . . . . . . . . 56 57 58 6-4 Abstract view of example ring modulator. Note that lower level metal features such as metal fill have been blocked out, since port connections are made at higher levels. . . . . . . . . . . . . . . . . . . . . . . . . 58 6-5 Abbreviated macro description in LEF format of example ring modulator. 59 6-6 Layout of the same circuit as in Figure ??. . . . . . . . . . . . . . . . 60 6-7 Schematic of circuit with both photonic and electrical components. 61 . List of Tables 3.1 Power Efficiency and Mismatch relative to Optical H-Tree Depth . . . 28 14 Chapter 1 Introduction The emergence of multi-core processors [29] [30] [31] [32] creates an increasingly complex routing environment for long interconnects. With traditional copper interconnects, area is expensive, as each wire is only able to carry one data signal at maximum bandwidth. Although the I/O circuits are improving in bandwidth as technology scales, the channels are not. Since each core must be able to communicate with a set of other cores and off-chip memory, the total bandwidth required is harder to reach under the overall die power constraint, set by power delivery and cooling requirements. If copper wires are not replaced, we will quickly reach a point where it is simply infeasible to add more cores. To address this problem, this work investigates the shift to an optical paradigm, where by exploiting wavelength-division multiplexing (WDM), multiple channels of data can be transmitted along a single optical waveguide without interference. This allows a much greater bandwidth density for both on-chip and off-chip channels. Also, with low loss waveguides which are theoretically possible, the total power consumption for interconnects can be greatly reduced. With WDM, we can use architectures that have fewer routers, hence lower latency. A cartoon of an integrated photonic intra-chip WDM link is shown in Figure 1-1 [1]. The laser light from an external laser source is coupled into chip A, each channel is modulated separately, and then sent to another chip to receive the data. Besides data transmission, it is possible to have an energy efficient and small area clock receiver with a single wavelength modulated clock signal, or a low power, low skew/jitter clock distribution network by using a mode-locked laser or self-pulsed laser to send short optical pulses into an optical clock distribution network. This alleviates the need for PLLs or DLLs, saving area and power. The focus of this thesis will be on optical clocking, and comparing it to a traditional clocking scheme. Extemal Lase Chip A Waveguide Source I Coupler Soure Chip 8 7 ~Ring Modulator 0 Waegude Rng Potdtco 0 Ring Filter /\ Potodetector OSRiggFlte Fiber Figure 1-1: An intra-chip integrated photonic link implementing WDM. An inter-chip link may be implemented similarly. [1] An important feature of this project is that all our photonic structures will be made using standard CMOS processes. Since the photonic devices and their control circuits are only tens of microns apart, this monolithic integration reduces the area and energy costs of interfacing electrical and optical components. Also, monolithic integration is practical in the sense of cost and manufacturability for processor and memory manufactures, but at the same time constrains the materials and dimensions that we are able to use. We believe that this is a tradeoff that must be made for silicon photonics to be implemented in mainstream processor design. A first integrated photonics chip was designed and manufactured in Texas Instrument's 65nm bulk CMOS [1], and the designs in this thesis were made in a 32nm bulk CMOS process. In the following sections, we will go over the key points of our photonic technology and describe the characteristics of the photonic devices used in this project. Next we will look at how clocking is done traditionally, then point out the reasons why optical clocking makes sense. The optical clock receiver circuit will be explained in detail, which is the main focus of this thesis. Finally, a quick summary of the infrastructure for on chip testing and automatic layout will be given. Chapter 2 Photonic Technology For simplicity of integration, and to be compatible with the majority of CMOS processes, we chose to design our chip with a monolithic CMOS process, without flip-chip bonding a III-V wafer or adding additional layers [1). Having this constraint narrows the design space, but as mentioned in the introduction, this allows the photonics to be imported into a microprocessor design flow without an overhaul of the existing technology. 2.1 Laser Source, Vertical Couplers and Optical Waveguides Due to the indirect bandgap of silicon, there are no known high-efficiency lasers in silicon, so all the silicon-photonic technologies proposed here use off-chip laser sources. To couple the light into the waveguides from an optical fiber, we use vertical couplers, which are more feasible from a packaging stand point compared to edge coupling. Vertical couplers have a bragg-grating like structure in a roughly 10pmn by 10pm area, which then tapers into a single-mode waveguide with a width of approximately have a micron. Previously, on-chip photonic waveguides have been made using the silicon body as a core with custom thick buried oxide (BOX) as cladding in a silicon-on-insulator (SOI) process [6], or by depositing silicon-nitride [7] on top of the interconnect stack. These methods either need significant process changes to a standard CMOS flow, or have poor heat dissipation properties, which may cause problems as high-power density is common in manycore computation. Given that compatibility to mainstream processes is a priority, we have chosen to use bulk CMOS, and the polysilicon layer for the waveguide. Since the shallow-trench oxide beneath the polysilicon is too thin to form an effective cladding and shield the core from optical mode leakage losses into the silicon substrate, we have developed a novel self-aligned post-processing procedure, as shown in Figure 2-1, to etch away the silicon substrate underneath the waveguide forming an air gap [2]. With adequate layout spacing and measured etching, this post-processing should not affect the transistors. Backend Dielectric Silicon Substrate Polysilicon Transistor Gates Shallow Trench ISolation Hole Polysilicon Waveguides Polysilion W m Etch Ho4e Ar Gap Figure 2-1: Cross-section and SEM image of the waveguides and air gap. Note that the air gap does not reach the transistors. [1, 2] 2.2 Optical Modulators Data is transmitted by modulating the continuous wave laser light, and this is done by changing the resonance frequency of a ring modulator that is near the waveguide. Due to their smaller size, ring modulators have a smaller capacitance, which results in less power consumption compared to Mach-Zehnder modulators. As shown in Figure 2-2, the intrinsic poly-silicon ring is sandwiched by N-type and P-type poly-silicon at each straight section, forming lateral PIN diodes, so we can inject free carriers into the ring. A scanning electron microscope (SEM) photo of the ring modulator is shown in Figure 2-3. By changing the amount of free carriers in the ring, we can change its effective index of refraction, which in turn change the resonance frequency. The continuous wave laser light that passes by on a waveguide will get coupled into the ring and absorbed if the resonance frequency matches the frequency of the light. Multiple transmitters can modulate light at different frequencies, so we can transmit multiple channels of data simultaneously on the same waveguide. The temperature of the ring also affects the index of refraction, so in-plane polysilicon heaters are placed to control the temperature. Due to the large current range to guarantee wavelength isolation, the heaters are directly connected to pads, and driven by current DACs on the test board. In order to control the temperature, the resistance of the heater is monitored by measuring the voltage drop across it with an off-chip ADC, and in conjunction with the temperature coefficient of polysilicon, we can derive the temperature of the heater. p+doped poly n+doped poly Drop Undoped poly waveguide Silicided pol~y heater resistor Through p+ doped poly n+ doped poly Figure 2-2: Illustration of ring modulator. Note that diodes were only implemented on the straight sections of the ring due to fabrication constraints. [3] 2.3 Photodiodes At the receiving end, a ring is tuned to the channel it wants to receive, and the light at that frequency is coupled into the photodiode of the receiver. Photodiodes for the receivers are made with SiGe, which are used in modern CMOS processes Figure 2-3: SEM image of a modulator. On the sides are vertical couplers for the input/output laser light, and the dark rectangles close to the ring are both PIN diodes used for modulating the ring. [1] to increase mobility in PMOS transistors by introducing strain. The percentage of Ge determines the bandgap of the material, so with more Ge the photodiode will be able to absorb longer wavelengths. In this process we are assuming the mole-fraction of Ge in the SiGe alloy is approximately 30%, which corresponds to a wavelength of about 1200nm. When the light is absorbed, an electron-hole pair are created, which are swept out due to the electric field from the P-N junction, and picked up by receiver circuits. Since these small photodiodes are integrated on chip, the wires in between the diodes and the receiver circuits can be very short, on the order of tens of mircometers. Compared to connecting to a photodiode off chip through a bond pad, this reduces the parasitic capacitance of the connection by two orders of magnitude. Another benefit of integrated photonics is that we are able to have the photodiode itself be a ring resonator coupled to the waveguide, so theoretically we can achieve a reasonable responsivity with a relatively small photodiode, which corresponds to a smaller parasitic capacitance for the diode itself. 2.4 Optical Pulse Source So far clock transmission or clock recovery has been achieved by several methods based on self-pulsating lasers [17] [18] [19] [25], mode-locked lasers [21] [22], or optoelectronic phase-lock loop (PLL) technologies [23] [24]. The complexity and area overhead of having all-optical phase comparators on a multicore processor is too great, so optoelectronic PLLs are better suited for intra-system links of data rates of 100Gb/s. This is also the case with using self-pulsating lasers to recover the clock from a data stream. We wish to leave all the complexity at the laser source, and have the photonics on chip be as small as possible. A mode locked laser produces a train of short pulses of light by inducing a fixed phase relationship between the modes of the laser's resonant cavity. A laser may have a number of modes of equal frequency separation, depending on the gain medium bandwidth, and the cavity length. By fixing the relative phase between each mode, the output amplitudes will periodically all constructively interfere with one another, producing a pulse of light with high intensity. These pulses will be separated in time by the round trip time T = 2L/c, where L is the length of the laser cavity. Semiconductor mode-locked lasers are attractive candidates for generating picosecond pulses at high repetition rates ( 10GHz) due to their small size, inherent robustness and possibility of integration with other semiconductor elements [211. It has been shown that a monolithic semiconductor laser can achieve a 10GHz repetition rate with a 6.6 picosecond pulse width and 50 to 570 femtosecond jitter, depending on the existence of an RF electrical reference. Although many of the lasers discussed are in the 1550nm wavelength range, which is un-suitable for our purposes, there have been more recent efforts toward making high repetition mode-locked lasers with vertical cavity surface emitting lasers (VCSELs) [26], or quantum-dot GaAs/InAs lasers [27] that are more suitable for our wavelength ranges. 2.5 Summary In this chapter, we cover the photonics components needed to make a WDM link, and the benefits or difficulties of designing each component in a monolithic CMOS process. Different devices are suitable for different architectures, so we try to explore all the trade off relations between design parameters for each device. Lastly a few candidates for the optical pulse source are presented and discussed. 22 Chapter 3 Clocking All synchronous electrical systems require a clock signal, whether it is for synchronizing communication channels in high speed links or for synchronizing functional blocks in a microprocessor. The important merits of a clock distribution scheme are skew, jitter, rise/fall time, and power consumption. Skew is when the same clock signal arrives at different nodes at different times, although they were designed to arrive simultaneously. Jitter is the random fluctuation of the clock edge. These both limit the maximum operation frequency the chip. We will discuss how clocking is done electrically in both links and synchronous logic, and how using optics will be beneficial. 3.1 Electrical Clocking for Links High speed links usually transmit a high speed clock synchronously with a parallel data bus, as shown in 3-1. The clock receiver uses a phase locked loop (PLL) or delay locked loop (DLL) to capture the edge information, and generate a sampling clock edge aligned with the eye opening of the input data stream. Having the data receivers sample at the phase where the eye opening is widest provides the largest signal to noise ratio (SNR), resulting in the lowest bit error rate (BER). In the case of using a PLL, a lower frequency clock signal can be used, saving the dynamic power consumed on that wire. refClk refClk data[O] data[n] data[] data[n] TxClk RxCIk D0 timing margins Figure 3-1: Conventional source-synchronous unidirectional and differential point-topoint parallel link, where the clock is sent along with data for easy receiver timing recovery. [4] To see how skew and jitter affects this link, a simple example is shown in Figure 3-2. Due to interconnect mismatches and possibly circuit mismatches, the timing for each channel might shift relative to the reference clock, so samples are taken away from widest opening of the eye, lowering the SNR, and resulting in a higher BER. Fortunately, this phase error is static, so there are circuit techniques to compensate for it. One way is to have an adjustable delay chain between the clock receiver output and the clock input of the data receiver, and compensating for the skew after a startup calibration sequence. The adjustable delay chain may be realized by adjusting [341, [35]. the delay for each stage, activating a different number of stages [33] phase interpolation and selecting different combinations of phases or by using Jitter is a random phase variation in a signal, caused by power supply noise and thermal in the transmitter and receiver, and interference coupled in the channels. This random cycle to cycle variation in clock edges shrinks the eye opening, lowering the SNR, resulting in a higher BER. The performance of the link depends on the receiver timing recovery, and as mentioned in the introduction, the channels do not improve with technology. Also, the large area and power consumption of low jitter PLLs and DLLs adds to the difficultly Transmitter Timing refCik refCIk data[O] Receiver Timing DO D 02 D03 dataO] D29 DOi 2 Sdat data[n] TxClk Dn Dni Dn2 Dn3 data[n]J NO DiDd Dn3 timi marg ns RxCik Figure 3-2: Inter-signal skew reduces timing margins at receiver. As inter-signal skew increases, samples are taken farther away from widest opening of eye, lowering the SNR, resulting in a higher BER. [4] of increasing the number of endpoints in a communication network and hence the number of cores. 3.2 Optical Clocking for Links Transmitter Slaser Receiver refClk 0 RxClk data[0] data[n] 0 Figure 3-3: A simple example of a point-to-point optical WDM link. The rings on the transmitter side are modulators driven by the driver circuits, and the rings on the right couple the selected wavelength into photodiodes that are connected to individual receivers. A simple example of a point-to-point optical WDM source forwarded link is shown in Figure 3-3. The photonic structure is the same as the WDM link shown in Figure 1-1, with driver and receive circuits added. Here we use one of the wavelengths for clock transmission, and the received clock signal triggers the samplers of the other data receivers. Using optics provides many benefits; Each waveguide can carry multiple wavelengths without crosstalk or power rail injected noise. Secondly, with low loss waveguides. the signals will not require re-buffering, hence reducing the jitter injected. This means that the receiver does not need to have a PLL or DLL to provide a stable clock, greatly reducing the power and area overhead. The design of a simple, low power clock receiver will be the focus of the next chapter. 3.3 Electrical Clocking for Synchronous Logic Data Register Data Register R1 -+D R2 Q Combinational -+D LogicQ Q + CLK Figure 3-4: Simple example of a synchronous pipeline datapath. [5] We have already discussed how skew and jitter affect a high speed link, now to see how they affect a synchronous system, a simple example is shown in Figure 3-4. At the initial rising clock edge, the value of Q is copied from D and held at that value after a delay of tc-Q for both data registers. This new value of Q causes the combinational logic to change its output state, and it settles to a final value after a propagation delay of tiogic. Then after holding the output value of the combinational logic for at least the setup time tctp of data register R2, the value of D will be transferred to Q at the next rising clock edge. So to ensure that data register R2 stores the result correctly, the duration between the two rising clock edges must be greater than the sum of these three t values, or: 1|fe1ock Teick >= tC--Q + tiogic + tsetup In the case of skew, we see that if the clock edge of R1 arrives earlier than R2, then the logic has more time to settle, so it seems fine. However in general, there are data paths in both directions between clock grids, so in the other direction the skew eats into the propagation time. Thus the clock frequency must be slowed down enough to accommodate the worst case skew. In the case of jitter, we can see that having a clock edge arriving early immediately after a clock edge arriving late squeezes the propagation time, so the clock frequency must be slowed down enough to accommodate the worst case jitter as well. Traditionally, electrical clocks are generated by PLLs, synchronized by DLLs, and distributed on the chip with trees and/or grids of metal wires. Skew can be caused by transistor mismatch in the routing, and jitter is mainly caused by power supply noise. A simple illustration is shown in Figure 3-5 The jitter of clock networks in processors has lowered over the years, and in a recent example [8], the 9-sigma jitter is down to 100ps, with a skew of 15ps and 3W power consumption for the entire clock network. Higher speed clocks require more repeaters in order to maintain reasonably sharp rise/fall edges, which cause more skew, jitter and power consumption. Lower frequency electrical clock Figure 3-5: Diagram of purely electrical clock distribution network. Note that due to the large area and power consumption of the PLLs, they are implemented at a high level. As an example, 8PLLs are used on a 503mm 2 die in [8]. 3.4 Optical Clocking for Synchronous Logic Besides the benefits of optical communication mentioned above, light may also be used for clocking synchronous logic. The idea of optical clocking was first published by J. W. Goodman in 1984 [9], and others [10, 11, 12, 13, 14] demonstrated different optical clock distribution schemes. The most recent of these is receiverless clocking [14], and although the two clock distribution schemes are different, some comparisons are made between them, just to put this work in context. One main motivation for optical clocking is the mode locked laser, which can have picosecond pulses, 10GHz or higher repetition rates, and sub-picosecond jitter, as demonstrated in [15]. Monolithically integrated mode-locked laser diodes with subpicosecond jitter have been demonstrated as well, which makes having a computing system with mode locked lasers possible [16]. optical receivers power efficiency 1 1.00 2 0.97 4 0.94 0.91 8 0.89 16 0.86 32 64 0.83 128 0.81 256 0.78 power mismatch 0% t2% +4% t6% ±8% ±10% ±13% ±15% t17% Table 3.1: Power Efficiency and Mismatch relative to Optical H-Tree Depth Since the laser source can have a high peak power, on the order of 100mW, we can also easily branch the clock signal. A splitting loss of less than 3% with a ratio of 49:51 has been demonstrated in a process similar to standard CMOS processes [5]. This enables us to have an optical h-tree with a large fanout, and still maintain a good power efficiency and low mismatch at the receiver end. The relationship between power efficiency and mismatch relative to optical h-tree depth is shown in Table 3. 1. In the case where there is zero area power overhead for clock receivers, since waveguide losses are very small, the optimal condition is to have optical RX at every clock grid, basically directly driving the grid wire cap, and the gate cap of the static and dynamic latches we wish to clock. This configuration is shown in Figure 3-6. Figure 3-6: Diagram of purely optical clock distribution network. But of course since there is a finite area and power consumption overhead for the receiver, the optimal point will be somewhere in between, with both optical h-tree and electrical buffering, as shown in Figure 3-7. With simple receivers that can be small in area and power consumption, we can optically split the clean clock signal further down the distribution tree before we transfer it to an electrical signal. This corresponds to fewer stages of repeaters the clock signal goes through after the receiver, which means less added jitter. Please note that a clock source using a single modulated wavelength as used in the links may be used for global clock distribution instead of a mode locked laser. This alleviates the need for a separate waveguide, at the cost of more jitter. Figure 3-7: Diagram of hybrid optical and electrical clock distribution network. 3.5 Goals of an Optical Clock Receiver The simplest possible receiver would be the receiverless clocking scheme [14], in which two photodiodes are connected in series from supply to ground, and by shining light pulses at them alternately, a clock signal can be generated. The only circuitry between the photodiodes and the clocked node are buffers, so a minimal amount of jitter is added. Also, less circuits means less circuit power consumption. Since in our design we use on-chip waveguides, we cannot afford the area to have two waveguides leading into two photodiodes at many places. Lastly, in order to swing the capacitance of two photodiodes rail-to-rail, the quantum efficiency of the photodiode must be sufficiently high or else the optical power required will be large. (Since the average optical power P, = C * VDD/(TB * R), where VDD is the supply voltage, C is the node capacitance, R is detector responsivity and TB is the bit period.) For a 5GHz clock and a reasonable photodiode, VDD = 1V, C = 20fF, R = 0.5A/W, TB = 100ps, then the required average optical power is 0.4mW, or -3.9dBm. Unfortunately, the quantum efficiency of the SiGe photodiodes created in the CMOS flow may not be able to reach 0.5A/W. So the overall power including optical power may not be optimal. Due to these reasons, a more sensitive but still simple receiver is desired. 3.6 Summary In this chapter we discuss the methods currently used for clocking in high speed links and microprocessor chips. We discover that optical clocks can have much better skew and jitter performance, due to the superior signal integrity of optics compared to electrical wires. This allows us to use a simpler circuit without PLLs or DLLs, greatly reducing the power and area overhead for a multi-core processor. 32 Chapter 4 Optical Clock Receiver Circuit Design Now that the reasons for having a small receiver circuit to amplify the photodiode signal are presented, let us first investigate the commonly used Trans-Impedance Amplifier (TIA). In its simplest form, a single stage TIA can be implemented with two transistors and a resistor, as shown in Figure 4-1. Vin Vout Figure 4-1: Schematic of a simple Trans-Impedance Amplifier. For simpler analysis, let us simplify the circuit further into the one shown in Figure 4-2. Vin R Vout Figure 4-2: Schematic of an amplifier of finite gain A with feedback. R is the feedback resistor, C is the input capacitance, I is the input current. With an amplifier of finite gain A, the output voltage is determined by the differential input. Since the input node is the dominant pole due to the large input capacitance provided by the photodiode, we can determine the sensitivity of the circuit by looking at the slope of the input node voltage Vin. In this circuit, we can see that initially, when fin = 0, then Vout = Vin = 0. If Iin is changed to some finite value, all the current goes to charge up the input capacitance C, so Vin begins to rise with a slope of in/C. Now because of the feedback path, some, and eventually all of the current goes through the resistor, hence the voltage at the input settles at (R * Iin)/(1+ A), the output settles at (-R *Iin * A)/(1+ A), and the voltage across the resistor is R * Iin as it should be. This transient voltage plot is shown in Figure 4-3. In the case where R = infinite, it is equivalent to removing the feedback resistor and having an open between input and output, as shown in Figure 4-4? Now all of Iin will be directed toward charging up the capacitance, so no current will be "stolen" by the resistor. Since the input current and input capacitance are determined by the photodiode, the maximum slope that can be achieved is fin/C. Now, the reason why the slope is so important is it determines the effect of voltage noise on jitter. Voltage noise shifts the voltage points up and down by a random amount, so the timing when the voltage crosses the switching threshold shifts left Vin R=oo Slope = 1_in/C r- R*l in (1+A) R=finite time Figure 4-3: Transient plots of input voltage for when R is finite and when R is infinite. and right in the time domain. With a steep slope, the same amount of voltage noise corresponds to less jitter, as shown on the left of Figure 4-5. So similar to the circuit in Figure 4-2, in the TIA circuit in Figure 4-1, the gain of a stage is limited by the resistance of the feedback resistor R, and that in turn is limited by the bandwidth the TIA is supposed to function at. The dominant pole is the large input capacitance, so a low inpedance is required for the receiver to operate at high bandwidth. 4.1 4.1.1 Injection Locked Loop Simple Integrating Receiver To increase the sensitivity, we could just simply take away the resistor, and let the input current be integrated on the capacitive node, like domino circuits. Since this is a clock receiver, there is no external clock to reset this capacitive node, so there must be some feature that resets the node after a clock edge is detected. As shown in Figure 4-6, we use an NMOS transistor to reset the input node. This essentially breaks the Vin Vout Figure 4-4: Schematic of an amplifier of finite gain A without feedback. C is the input capacitance, I is the input current. Input-referred voltage noise jitter Input-referred voltage noise jitter Figure 4-5: Diagram explaining the effect of voltage noise on jitter. A steep slope has less jitter for the same input noise. trade off between gain and bandwidth for a TIA. By having two states - a high gain state and a reset state - we can achieve high sensitivity when we are expecting edge information, and have a high bandwidth reset state for higher operation frequency. For simulation, the photodiode is modeled as a current source and a 10fF capacitance in parallel. Although the mode lock laser can provide very short pulses in the sub-picosecond range, there is still a finite transit time of the photodiode itself. In the photodiode design implemented in this chip, we can expect the transit time to be on the order of Witrinsic/Vhsat, which is the intrinsic region width divided by the hole saturation velocity, which is about 5ps. For simulations, the current pulse is out Figure 4-6: Schematic of simple clock receiver with reset NMOS. modeled as a triangular pulse with a half-max width of 10ps, which is a conservative estimate. The frequency is also slowed down to 5GHz so the waveforms are easily readable. Please note that the same circuit can also be used if the optical input is a single wavelength modulated clock pulse, as we are integrating the input current onto a capacitor. The slope will be worse, so there will be more jitter, but still much better than electrical clock distribution. The transient operation of the circuit is shown in Figure 4-7. When an optical pulse hits the photodiode, the generated electron hole pairs create a current, charging up node in. The voltage change at node in is determined by AV = (R *f' 0 P(t)dt)/Cin Where R is the responsivity of the photodiode ( A/W ), to to treset is the time period between when the optical pulse arrives to when the reset is triggered. P(t) is the instantaneous optical power that reaches the photodiode, and Cin is the total capacitance at node in. In this case, we see that AV is VDD, or 1V, which causes the inverters to change their output voltage. After two inverter delays, the voltage at node reset goes high, turning on the reset NMOS, pulling node in back down to a low voltage. In this circuit, skew is determined by the process mismatch of the inverter chain, which affects the inverter delays. Jitter is mostly determined by the power supply noise injected through the inverters. So obviously more stages of inverters EyeDiagram 12 EyeDiagram -- 1---4 1 1. -.-------- . . .- ... -- - 08 ...... .... .. >04 >4 - 0.2 -- - ... ... 0 2 . ......... 0 62 ---.- --. ..... 00... - ------------ - --- ------- - - 0o .. . ....... . -- r _ 2 60 -0.2 4 5 05 -- -----.---.. ..... 0 1 2 1..-.----.--. .......... .... 3 0 EyeDiagram 10 .---.--- --.. .------ __ 6 Time (seconds) Eye Diagram 10 - ........ ri __ - - Time(seconds) 1 5. ... .. ... 4 5 6 0..... 5.- 0 1 2 3 4 5 6 Figure 4-7: Transient waveforms at node in (top left) and out (top right) with respect to the switching threshold of the first inverter Vm and optical pulses. correspond to more skew and jitter. Notice that since the pulse width is independent of the pulse frequency, so the duty cycle will have to be controlled by adjusting the length of the inverter chain, or by using a divide by 2 circuit at the output. Integrating Receiver with Tunable Reset 4.1.2 Since the node in from Figure 4-6 swings from OV to 1V, the power consumption is the same as the receiverless design. To control how far below Vm node in reaches, we stack a DAC of NMOS transistors with the reset NMOS, creating a tunable strength reset NMOS stack, as shown in Figure 4-8. Vbias Figure 4-8: Schematic of simple clock receiver with a tunable reset NMOS stack. 06 EyeDiagram EyeDiagram 0.65 - -. - 0.55 05 045 - 0 1 2 3 Time(seconds) 6 5 x1S Time(seconds) x10 EyeDiagram 410- - 0 4 1 2 3 Time(seconds) - --- 4 4 5 0 3 Time(seconds) 4 6 5 010 Figure 4-9: Transient waveforms at node in (top left) and out (top right) with respect to the switching threshold of the first inverter Vm and optical pulses. The transient operation of the circuit is shown in Figure 4-9. Note that since we have a smaller voltage swing, the optical pulse power required is greatly reduced. Also note that any voltage noise at node in will be translated into jitter in time by the slope at which the voltage crosses Vm, which is determined by R * P/C. tunable NMOS stack b[O] b[1] b[2 b[3] b[4] b[5] Vbias M=1 M=1 M=2 M=4 M=8 M=16 Figure 4-10: Schematic of tunable reset NMOS stack. A schematic of the tunable NMOS stack is shown in Figure 4-10. The lower row of NMOS transistors are binary weighted current sources. Due to the small sizes of these transistors, they are vulnerable to mismatch, so the smallest transistor is duplicated to improve the linearity after calibration. Programmable registers control which current sources are connected the the reset NMOS through minimum size transistor switches. During the integration step, the reset NMOS is off, so the input node is isolated from the parasitic capacitance of the current sources. Hence we are able to have a large tuning range without heavily increasing the capacitive load to the input node or the reset node. 4.1.3 Integrating Receiver with Precharge - Injection-Locked To increase the operation frequency, we add a PMOS precharge branch that turns on after the input node discharges past Vm, as shown in Figure 4-11. This allows us to discharge the input node much faster with a stronger NMOS stack, resulting in a net gain of speed, as shown in Figure 4-12. By looking carefully at the waveform of node in, we see that at the moment before the clock pulse comes in, it is always pre-charged to just below Vm, so the optical charge is always just enough to make node in go past Vm. The slope at which the voltage at node in crosses V,, is now (Iprecharge + R * P)/C.. Another way of understanding this circuit is that the first two inverters and the discharge/reset stacks form a three inverter ring oscillator, with a self oscillation frequency set to be slightly lower than the input clock frequency, which is exactly an injection-locked loop. Depending on the settings of the discharge/reset stacks, the ring oscillator will lock onto a certain range of (input current, frequency) values. reset 1 in 2 3 4 Vbiasn Figure 4-11: Schematic of manually tuned clock receiver circuit with PMOS precharge branch. EyeDiagram EyeDiagram 4- 0 EyeDiagram 0.5 1 1.5 2 2.5 3 Time(seconds) 35 4 4.5 5 0 a-1oa 0.5 1 1.5 2 2.5 3 Time(seconds) 35 4 45 5 alS-1 Figure 4-12: Transient waveforms at node in (top left) and out (top right) with respect to the switching threshold of the first inverter Vm and optical pulses. 42 4.2 Auto-locking design In monte carlo simulations, we are always able to achieve a lock by adjusting the configuration bits. However in an actual chip, it would be much less desirable to calibrate every receiver for every die. Also, the amount of optical power that reaches each receiver may vary after the chip is packaged and optically connected to other chips, so calibration may be required after a system is put together. Finally, and perhaps the biggest issue is that the properties of photodiodes, lasers, modulators and even transistors change with temperature, and the temperature variation on a microprocessor die can be 50 or more degrees [28]. This gives us a motivation to have a circuit that can adapt to a fluctuating environment. Vauto reset Vauto out in -0--- - -- Vauto Figure 4-13: Schematic of auto-locking clock receiver circuit. The node 'Vauto' coilnects the capacitor to the gates of the tuning transistors, adjusting their strength. The auto-locking version is shown in Figure 4-13. If the NMOS branch is too strong, the receiver takes more time, maybe more than a cycle, to pull the input node back up to Vm, so the duty cycle is less than 50%. Similarly, if the NMOS is too weak, the duty cycle is greater than 50%. Using this relationship, we can integrate this duty cycle difference onto a capacitor, so the strengths of the NMOS and PMOS are gradually tuned until a lock is achieved, as shown in Figure 4-14. When locked, the voltage at Vauto will stay constant if the duty cycles are inversely proportional to the current source current values, so ideally, if the current sources are matched, then the duty cycle will be 50%. The simulated receiver sensitivity is -14dBm at 9GHz, consuming 77.14iW and generating jitter within 0.15ps. In comparasion, to operate at 9GHz, the receiverless design in [14] requires -1.4dBm optical power. A photodiode responsivity of 0.5A/W is assumed in both cases. 1: /vrn..prime Figure 4-14: Simulated transient waveform of in, output(out4), and Vauto(vm-prime). Before 60ns, the reset NMOS is too weak, so the input node is constantly above Vm, hence out4 is stuck at Vdd. Vauto slowly rises until lock is achieved at around 60ns and stabilizes at about 65 ns. In the extreme case of process variations and mismatches, the clock receiver may riot achieve a lock at the desired frequency and laser power. For instance, if the two current sources are not matched, then the voltage on Vauto will continue to change after a lock is achieved. This in turn changes the strengths of the reset transistors, which changes the duty cycle to compensate for the ratio mismatch, and settles at some non-50% duty cycle. But if the mismatch is too great, the duty cycle may be forced to be skewed even further, which may cause the receiver to fail. This can be avoided by increasing the size of the current mirror transistors, or having a few bits of tuning capability for one of the current sources. Another feature that could be made automatic is the dark current compensation. Since the photodiode will produce some leakage current in the absence of light, a manually tunable current source to compensate for this DC current was added. The final schematic would be something like Figure 4-15. Tunable current source Vauto reset T Vbias - IF Vauto Tunable current source Figure 4-15: Schematic of auto-locking clock receiver circuit with dark current compensation and independently tunable current sources for duty cycle correction. For the purposes of testing this injection-locked auto-tuning concept, the receiver circuit implemented in EOS2 was a single ended circuit which is small in area and power consumption, but obviously more vulnerable to power rail noise. A fully differential version would not only provide power supply noise rejection, but also a higher operation frequency can be possible with analog style differential pairs, at the cost of burning more power. An additional benefit is with a differential connection to the photodiode, the DC voltage across the diode is OV, hence no dark current to worry about. 4.3 Summary Here we present the limitations of a traditional TIA receiver, arid how to break the bandwidth/sensitivity tradeoff by taking advantage of the fact that this is a clock receiver. Starting with a simple self-reset design, we add on the injection-locked feature, then the auto-tuning capability, resulting in a high speed, low power, low jitter receiver that can adapt to a varying environment. 46 Chapter 5 EOS2 32nm Test Chip 5.1 EOS2 Chip Overview We completed a test chip in the 32nm process with all the components mentioned in the photonic technology section, shown in Figure 5-1. To characterize waveguide loss, we have sets of waveguides of various lengths with identical bends. Thus by measuring the losses of a set of waveguides, we can decouple the insertion losses of the vertical couplers and bending losses, leaving us an accurate loss per centimeter figure. There are also loss rings to measure the loss more accurately if it is very small. Individual modulators and photodetectors are connected to high speed probe pads (GSG pads), for a thorough optical-electrical characterization. Modulator rings are accompanied with heaters, so the relation between frequency shift and temperature can be measured. Due to the fact that this chip is manufactured in a 32nmn pilot run, for many of the photonic devices, different designs and sizings are implemented, such that the variation of unknown design parameters such as doping concentration are covered by a design. The hope is that if a design of one set of parameters is tested to be functional, the next iteration of the chip can narrow down the types of variations implemented. In the center of the chip are WDM photonic links connected to transmitter and receiver circuits. We have waveguides that connect modulators and receivers, with heaters for each ring, so multiple channels can operate simultaneously. Figure 5-1: Layout view of EOS2 test chip. Sandwiched between loss measurement structures and individually testable optical devices are the circuit/photonic WDM rows. The chip dimensions are 2mm by 2mm. 48 5.2 EOS2 Cell Clock Rx (x2) Modulator (x2) and PRBS Data Rx (x2) electronic circuit cellI Figure 5-2: Layout view of circuit cell of EOS2 with photonic devices. Note the relative size of the vertical couplers, rings and spacing between photonics and circuits. Also note the row of etch vias below the photonics. The circuit cell dimensions are 220pm by 40pm. Since we plan to run the links at 5Gb/s, we have all our test structures on chip, including a programmable PRBS to generate the data sequence for the modulator, a data snapshot and counter for error checking, and many registers to store configuration bits for the modulator and both data and clock receivers. With most of the digital infrastructure on the chip, no high speed outputs are required, and we are area limited in how many devices we are able to implement and test, instead of being pad limited. This digital infrastructure with analog blocks is denoted as a "cell", a block diagram is shown in Figure 5-3, and a layout view with the adjacent photonics is shown in Figure 5-2. For ease of layout and verification, one unique cell is created, and then repeated in rows. The photonics, whether WDM links or individual test structures, are then connected to the regularly spaced pin locations of the clock receivers, data receivers and modulators, making the hookup process simple and easy to verify. Scanchain Buffers SPh R Counter High-speed clock buffer Rx at RxClock (x2) Mklodlator-1 Modulat Figure 5-3: Block diagram of a circuit cell of EOS2. Analog blocks are connected to photonic devices below the cell, and the high speed support and low speed configuration registers are above. All scan signals and high speed clocks were buffered in each cel, and then passed on to the next cell in each row. This is fine for low speed scan signals as long as we avoid hold time violations, but unfortunately for high speed clocks, the jitter and duty cycle will deteriorate further down the chain. This problem has been fixed in the latest chip. 5.3 5.3.1 EOS2 Testing Clock Receiver Testing high speed reference clock asynchronous reference counter I I I II I I II shutoff signal asynchronous circuit counter F77 I I I I I I I receiver circuit clock Figure 5-4: Block diagram of the high speed testing infrastructure for the clock receiver circuit. The number of bits in each counter are actually 20. (modified to 40 bits in later chip for better accuracy) The testing infrastructure is set up to measure the output clock frequency of the receiver circuit relative to an off chip high speed clock source. The implementation is shown in Figure 5-4. The two asynchronous counters are initially set to all zeros, then the high speed inputs are enabled from the configuration registers. The counter count the number of cycles, until the MSB of the reference counter changes to 1. This is the shutoff signal that quickly stops the input signals from coming in. By reading out and comparing the two counter values, we can measure the frequency ratio between the reference clock and the received clock. For instance, if both clocks are matched, then they should both read out the same value. There will be some offset in the counter value, since the overflow propagates through every counter bit and then shuts off the clock input, so a few extra cycles will be counted. This offset is a fixed delay in time, and can be easily calibrated out. Also, as a backup in case the photodetectors do not work as well as planned, and for debugging purposes, an electrical pulsed current source can be enabled as the input to the clock receiver circuit. This self test circuit is driven by a high speed clock, and the pulse strength and width can be configured to mimic different photodiode current profiles. Measuring the relative frequencies allows us to know whether or not the receiver is properly locked, but to do jitter measurements, a more sophisticated test circuit is required. In a later chip, the receiver circuit output is also sampled with a reference clock, so by adjusting the relative phases, we can derive a cummulative histogram of received clock edge phases with respect to the reference clock. 5.3.2 Chip Testing All scan chain control signals are connected to an FPGA board, with a state machine that can interface with a python command set. The functions include resetting the chip, reading out the cell values of a target cell, writing a user defined configuration to a target cell. Although the scan chain is about 20000 registers long, the FPGA can operate at a MHz frequency, so each operation only takes a fraction of a second. When reading and writing a target cell, the values of the other cells are preserved, this allows us to configure multiple cells easily. An example of a write operation is shown in Figure 5-5. To perform a purely electrical clock receiver test, we configure a target cell to have the self test circuit enabled, clock receiver enabled, and high speed clocks enabled. After the experiment is done, we read out the counter values of that target cell and adjust the clock receiver configuration if necessary. Once we have confidence the circuits work with the electrical self test, we can move the chip to our optical table to perform full optical/electrical tests. For the clock receiver, this includes using a mode-locked laser as the clock source, and a modulated single wavelength laser source. Cell 0 Cell 1 Cell 2 Cell 3 time = 0 A B C D time = 1 D A B C New D A B New D A New D time = 2 New A time = 3 time = 4 B4 L>-A B Figure 5-5: Here we write the value "New" to Cell 2, while preserving the values of the other cells. This is done by feeding back the last bit of the scanchain into the input, and only feeding in new data at the right time. 5.4 Summary In this chapter we give an overview of the features in the EOS2 chip, the electrical cell architecture, and the procedures to test the chip. The key is to have most of the digital infrastructure on the chip, so no high speed outputs are required, and we can test as many devices as we can fit. When designing a chip with both photonics and digital backend of this complexity, we see the need to fully integrate the photonics into our CMOS design flow. This includes parametrized layout generation for photonic components, photonic cell preparation for automatic place and route, and post route design rule checking and connectivity checking. The issues will be addressed in the next chapter. 54 Chapter 6 Integrated CMOS/Photonic Chip Design For an electronic/photonic chip to be designed in a truly integrated fashion, there must be a photonic equivalent to every step of the CMOS design flow. These include parametrized layout generation or p-cells, macro preparation for automated place and route, design rule checking (DRC) and layout vs. schematic checking (LVS). 6.1 Parametrized photonic layout generation and optimization In this project, we have used p-cells to draw the photonic devices [14]. A p-cell lets the designer to input the design parameters of a device, and refeclts the changes in the layout immediately. For instance, for a waveguide bend, the user and specify the waveguide width, bend radius, bend angle, etc. More parameters can be added later on if more degrees of freedom are needed. In stream-out, Cadence merges rectangles into many-vertex polygons (max 4000 vertices). So this means we can draw polygons with many vertices. One major benefit is redundant vertices are minimized, saving disk space. Also, Cadence displays simplified polygons when zoomed out, so the screen refreshes faster in a complex design. This effect is shown in Figure 6-1. Figure 6-1: The waveguide on the left is drawn with polygons with many vertices; the waveguide on the right is drawn with just rectangles. 6.2 Design Automation of Electric and Photonic Integrated Circuits For most commercial digital chips, the physical layout is done automatically with CAD ( Computer Assisted Design) tools. The user starts with a behavioral description of the chip in Verilog that describes the functionality of the chip, then proceeds to synthesize the chip into a gate level design, where we have a netlist of logic gates. The next procedure is floorplaning, where the designer places the power wires, pads, and higher level module blocks of the chip. Finally a detailed place and route is performed, where the standard cell layout corresponding to each logic gate are placed and wired up. Modern CAD tools let the user adjust the optimization parameters, such as area, clock frequency, signal integrity etc, and warn the designer of bottlenecks and violations, so that many iterations can be quickly made, saving the designer a significant amount of time. For photonics to be integrated in a complex chip, it also needs to be compatible with this CAD tool flow. A straight forward solution is to treat photonic blocks the same way as we treat logic gate standard cells. This integrated electric and photonic circuit design flow is shown in Figure 6-2. So the only additional step is to create a Chip-level verilog (instantiation of LEF macros and SOC Encounter Place and route connectivity) Figure 6-2: Integrated electric and photonic circuit design flow macro description (using the industry standard LEF format) for each photonic device from the layout created in the method described in the previous section. This LEF file contains information about the device's size, metal layer geometries, and port geometries for electrical connections. The CAD tool needs to know the size of the device so that it can be tightly placed without overlapping with other devices. The metal layer geometries and port geometries allow the CAD tool to route signals to the correct ports as specified in the Verilog netlist. As an example, a layout of a ring modulator is shown in Figure 6-3. After labeling the port names, an abstract can be generated, leaving only the essential layer information, as shown in Figure 6-4. Then from this abstract, we can create a LEF macro (abbreviated version shown in Figure 6-5), specifying the device name, size, port locations and metal geometries. To plug this back into the CMOS design flow, we simply reference these new libraries in hte tool, add instantiations of photonic macros, and make connections to other macros, all as described in the verilog code. pp :~ tt91 r 44 9 V :crJ Figure 6-3: Layout view of example ring modulator Figure 6-4: Abstract view of example ring modulator. Note that lower level metal features such as metal fill have been blocked out, since poit connections are made at higher levels. VERSION 5.6; BUSBITCHAR S "f]" DIVIDERCHAR "I'; MACRO block-electronic etchrow 1 CLASS BLOCK; ORIGIN -208 -1794; FOREIGN blockelectronicetchrow_1 208 1794 SIZE 2488 BY 165 ; SYMMETRY X Y R90; PIN heater a_1 DIRECTION INOUT; USE SIGNAL; PORT LAYER u; RECT 431 1870.5436.51882; END END heater a_1 OBS LAYER ml; RECT 2081794 2696 1959; END END blockelectronicetchrow_1 END LIBRARY Figure 6-5: Abbreviated macro description in LEF format of example ring modulator. 6.3 Simple design check using LVS and DRC As we try to build truly integrated photonic-electrical circuits, LVS capability is crucial, since humans can only be error-free up to a certain complexity level. To indicate optical connectivity, we can use a layer that is unused by the electrical process, so the optical and electrical nets are independent of each other. We also add additional layers to help us extract optical devices and their parameters. Thus the extraction tool will be able to recognize photonic devices and compare the layout (Figure 6-6) with the designed schematic (Figure 6-7). CMOS CIRCUIT Figure 6-6: Layout of the same circuit as in Figure 6-7. In addition to checking connectivity, by making the optical connectivity layer wider than the actual waveguide, LVS will detect a short if two waveguides are too close to each other. Additional DRC checks may be made by adding rules to the DRC rules file. For example, we can check if the waveguides have the proper silicide, metal fill and dopant blocking layers overlapping them. CMOS CIRCUIT T Figure 6-7: Schematic of circuit with both photonic and electrical components. 6.4 Summary This chapter focuses on the integration of photonics with circuits from a design perspective. Using existing tools, we develop methods to draw photonic devices, verify the circuit and include the photonic devices within a standard CMOS place and route flow. 62 Chapter 7 Conclusion and Future Work 7.1 Conclusion The bottleneck of multi-core processors performance will be the I/O, for both onchip core-to-core I/O, and off-chip core-to-memory. Integrated silicon photonics can potentially provide high-bandwidth low-power signal and clock distribution for multicore processors, by exploiting wavelength-division multiplexing. This thesis presents the technology environment of the monolithic optical/electrical chip, and then focuses on how an optical method would look like for both source-synchronous link and for on-chip global clock distribution. The injection-locked loop clock receiver that suits this architecture breaks the bandwidth/sensitivity tradeoff, and a self adjusting mechanism is added to increase robustness. The simulated receiver sensitivity is 14dBm at 9GHz, consuming 77.14pW and generating jitter within 0. 15ps when locked onto a mode-locked laser clock source. We then present the chip infrastructure and testing procedures, also methods to make the photonic device design more compatible with a CMOS circuit design flow, putting us closer to true integration. 7.2 Future Work As already mentioned in chapter 4, a fully differential receiver circuit would be less vulnerable to power supply noise, and can potentially operate at a higher frequency. Another technique to explore is to have a frequency dividing mechanism for the reset, so the electrical clock frequency can be an integer multiple of the input optical clock source. Although the jitter will increase since the phase noise is not cleaned up every cycle, this will allow us to use mode-locked lasers of a lower repetition rate, which are more easily accessible. Another improvement that is orthogonal to the two mentioned above, is to explore the possibility of using a phase comparator as the reset/precharge strength tuning mechanism. This will help us decouple the two features that we wish to tune, frequency and duty cycle, and provide a larger range of operation. For instance, in the current design, if the optical input frequency is much lower than the designed range, the duty cycle will be much less than 50% if locked, so the auto tune will push itself out of lock. However if we use a phase comparison mechanism, it will stay locked once a lock is achieved, and can proceed to fix the duty cycle. The receiver would still rely on the injected clock source to clean up the phase noise, so the jitter would still be much better than a PLL. Bibliography [1] C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. Holzwarth, M. Popovic, H. Li, H. Smith, J. Hoyt, F. Kartner, R. Ram, V. Stojanovic, and K. Asanovic, "Building manycore processor-to-dram networks with monolithic silicon photonics," in High PerformanceInterconnects, 2008. HOTI '08. 16th IEEE Symposium on, pp. 21-30, Aug. 2008. [2] C. W. Holzwarth, J. S. Orcutt, H. Li, M. A. Popovic, V. Stojanovic, J. L. Hoyt, R. J. Ram, and H. I. Smith, "Localized substrate removal technique enabling strong-confinement microphotonics in bulk si cmos processes, in Conference on Lasers and Electro-Optics/Quantum Electronics and Laser Science Conference and Photonic Applications Systems Technologies, p. CThKK5, Optical Society of America, 2008. [3] B. Moss, A 10 Gb/s Optical Modulator in 32 nm Bulk-CMOS. PhD thesis, Massachusetts Institute of Technology, USA, 2009. [4] E. Yeung and M. Horowitz, "A 2.4 gb/s/pin simultaneous bidirectional parallel link with per-pin skew compensation," Solid-State Circuits, IEEE Journal of, vol. 35, pp. 1619 -1628, nov 2000. [5] D. Ahn, Intrachip Clock Signal Distribution via Si-based Optical Interconnect. PhD thesis, Massachusetts Institute of Technology, USA, 2006. [6] C. Gunn, "Cmos photonics for high-speed interconnects," Micro, IEEE, vol. 26, pp. 58-66, March-April 2006. [7] T. Barwicz, H. Byun, F. Gan, C. W. Holzwarth, M. A. Popovic, P. T. Rakich, M. R. Watts, E. P. Ippen, F. X. Kartner, H. I. Smith, J. S. Orcutt, R. J. Ram, V. Stojanovic, 0. 0. Olubuyide, J. L. Hoyt, S. Spector, M. Geis, M. Grein, T. Lyszczarz, and J. U. Yoon, "Silicon photonics for compact. energy-efficient interconnects," J. Opt. Netw., vol. 6, no. 1, pp. 63-73, 2007. [8] R. Kuppuswamy, S. Sawant, S. Balasubramanian, P. Kaushik, N. Natarajan, and J. Gilbert, "Over one million tpcc with a 45nm 6-core xeon cpu," in SolidState Circuits Conference - Digest of Technical Papers, 2009. ISSCC 2009. IEEE International,pp. 70-71,71a, Feb. 2009. [9] J. Goodman, F. Leonberger, S.-Y. Kung, and R. Athale, "Optical interconnections for vlsi systems," Proceedings of the IEEE, vol. 72, pp. 850-866, July 1984. [10] P. Delfyett, D. Hartman, and S. Ahmad, "Optical clock distribution using a mode-locked semiconductor laser diode system," Lightwave Technology, Journal of, vol. 9, pp. 1646-1649, Dec 1991. [11] Y. Li, T. Wang, J.-K. Rhee, and L. Wang, "Multigigabits per second board-level clock distribution schemes using laminated end-tapered fiber bundles," Photonics Technology Letters, IEEE, vol. 10, pp. 884-886, Jun 1998. [12] R. Chen, L. Lin, C. Choi, Y. Liu, B. Bihari, L. Wu, S. Tang, R. Wickman, B. Picor, M. Hibb-Brenner, J. Bristow, and Y. Liu, "Fully embedded boardlevel guided-wave optoelectronic interconnects," Proceedingsof the IEEE, vol. 88, pp. 780-793, Jun 2000. [13] S. J. Walker and J. Jahns, "Array generation with multilevel phase gratings," J. Opt. Soc. Am. A, vol. 7, no. 8, pp. 1509-1513, 1990. [14] C. Debaes, A. Bhatnagar, D. Agarwal, R. Chen, G. Keeler, N. Helman, H. Thienpont, and D. Miller, "Receiver-less optical clock injection for clock distribution networks," Selected Topics in Quantum Electronics, IEEE Journal of, vol. 9, pp. 400-409, March-April 2003. [15] U. Keller, R. Paschotta, S. Lecomte, R. Haring, L. Krainer, G. Spuhler, and K. Weingarten, "Novel high-performance pulse generating lasers from 10 ghz to 160 ghz," in Lasers and Electro-Optics Society, 2002. LEOS 2002. The 15th Annual Meeting of the IEEE, vol. 2, pp. 469-470 vol.2, Nov. 2002. [16] H.-J. Lohe, R. Scollo, J. Holzman, F. Robin, D. Erni, W. Vogt, E. Gini, and H. Jackel, "Comparison of monolithically integrated mode-locked laser diodes with uni-traveling carrier and multi-quantum well saturable absorbers," in The 18th Annual Meeting of the IEEE Lasers and Electro-Optics Society, 2005., pp. 874-875, 2005. [17] U. Feiste, D. As, and A. Ehrhardt, "18 ghz all-optical frequency locking and clock recovery using a self-pulsating two-section dfb-laser," Photonics Technology Letters, IEEE, vol. 6, pp. 106 -108, jan 1994. [18] W. Mao, Y. Li, M. Al-Mumin, and G. Li, "40 gbit/s all-optical clock recovery using two-section gain-coupled dfb laser and semiconductor optical amplifier," Electronics Letters, vol. 37, pp. 1302 -1303, 11 2001. [19] M. Leiria, I. Kim, A. Cartaxo, and G. Li, "Experimental study of patternindependent phase noise accumulation in an all-optical clock recovery chain based on two-section gain-coupled dfb lasers," Lightwave Technology, Journal of, vol. 26, pp. 1661 -1670, june15, 2008. [20] W. Mao, Y. Li, M. Al-Mumin, and G. Li, "40 gbit/s all-optical clock recovery using two-section gain-coupled dfb laser and semiconductor optical amplifier," Electronics Letters, vol. 37, pp. 1302 -1303, 11 2001. [211 K. Yvind, D. Larsson, L. Christiansen, J. Mork, J. Hvam, and J. Hanberg, "Highperformance 10 ghz all-active monolithic modelocked semiconductor lasers,"' Electronics Letters, vol. 40, pp. 735 - 737, 10 2004. [22 X. Wang, H. Yokoyama, and T. Shimizu, "Synchronized harmonic frequency mode-locking with laser diodes through optical pulse train injection," Photonics Technology Letters, IEEE, vol. 8, pp. 617 -619, may 1996. [23] C. Boerner, C. Schubert, C. Schmidt, E. Hilliger, V. Marembert, J. Berger, S. Ferber, E. Dietrich, R. Ludwig, B. Schmauss, and H. Weber, "160 gbit/s clock recovery with electro-optical pll using bidirectionally operated electroabsorption modulator as phase comparator," Electronics Letters, vol. 39, pp. 1071 - 1073, 10 2003. [24] 0. Kamatani and S. Kawanishi, "Ultrahigh-speed clock recovery with phase lock loop based on four-wave mixing in a traveling-wave laser diode amplifier," Lightwave Technology, Journal of, vol. 14, pp. 1757 -1767, aug 1996. [25] Y. Li, C. Kim, G. Li, Y. Kaneko, R. Jungerman, and 0. Buccafusca, "Wavelength and polarization insensitive all-optical clock recovery from 96-gb/s data by using a two-section gain-coupled dfb laser," Photonics Technology Letters, IEEE, vol. 15, pp. 590 -592, april 2003. [26] L. Mutter, V. Iakovlev, A. Caliman, A. Mereuta, A. Sirbu, and E. Kapon, "Phase-locked 1.3 xOOb5;m vcsel arrays based on patterned tunnel junction," in Lasers and Electro-Optics 2009 and the European Quantum Electronics Conference. CLEO Europe - EQEC 2009. European Conference on, pp. 1 -1, 14-19 2009. [27] F. Kefelian, S. O'Donoghue, M. Todaro, J. Mclnerney, and G. Huyet, "High repetition rate monolithic passively mode-locked semiconductor quantum-dot laser: Investigation of the locking regimes and the rf linewidth," in Lasers and ElectroOptics, 2007. CLEO 2007. Conference on, pp. 1 -2, 6-11 2007. [28] C. Poirier, R. McGowen, C. Bostak, and S. Naffziger, "Power and temperature control on a 90nm itanium reg;-family processor," in Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, pp. 304 -305 Vol. 1, 10-10 2005. [29] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook, "Tile64 - processor: A 64-core soc with mesh interconnect," in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International,pp. 88 -598, 3-7 2008. [30] A. S. Leon, K. W. Tarn, J. L. Shin, D. Weisner, and F. Schumacher, "A powerefficient high-throughput 32-thread sparc processor," Solid-State Circuits, IEEE Journal of, vol. 42, pp. 7 -16, jan. 2007. [31] D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, P. Harvey, H. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny, M. Riley, D. Stasiak, M. Suzuoki, 0. Takahashi, J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa, "Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor," Solid-State Circuits, IEEE Journal of, vol. 41, pp. 179 - 196, jan. 2006. [32] D. Wendel, R. Kalla, R. Cargoni, J. Clables, J. Friedrich, R. Frech, J. Kahle, B. Sinharoy, W. Starke, S. Taylor, S. Weitzel, S. Chu, S. Islam, and V. Zyuban, "The implementation of power7tm: A highly parallel and scalable multi-core high-end server processor," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International,pp. 102 -103, 7-11 2010. [33] D. Cecchi, M. Dina, and C. Preuss, "A 1gb/s sci link in 0.8 mu;n bicmos," in Solid-State Circuits Conference, 1995. Digest of Technical Papers. 42nd ISSCC, 1995 IEEE International,pp. 326 -327, 15-17 1995. [34] T. Sato, Y. Nishio, T. Sugano, and Y. Nakagome, "A 5-gbyte/s data-transfer scheme with bit-to-bit skew control for synchronous dram," Solid-State Circuits, IEEE Journal of, vol. 34, pp. 653 -660, may 1999. [35] K. Gotoh, H. Tamura, H. Takauchi, T. S. Cheung, W. Gai, Y. Koyanagi, R. Schober, R. Sastry, and F. Chen, "A 2b parallel 1.25 gb/s interconnect i/o interface with self-configurable link and plesiochronous clocking," in Solid-State Circuits Conference, 1999. Digest of Technical Papers. ISSCC. 1999 IEEE International,pp. 180 -181, 1999. [36] J. Orcutt and R. Ram, "Photonic device layout within the foundry cmos design environment," Photonics Technology Letters, IEEE, vol. 22, pp. 544 -546, april15, 2010.