The Effectiveness of Stacked-VDD on Different FullAdder Topologies without Voltage Regulation by Wenlu Sun & Jiyati Verma ECE 6332 – Fall 2011 [email protected], [email protected] ABSTRACT Typically, supply noise does not lessen with decreasing VDD, contributing to the already growing problem of power density. One way to scale VDD and suppress power dissipation is to stack circuits that run at multiples of original supply voltage. The limitation of this technique is the need for voltage regulation of the middle node on the output side of the circuit: the more this voltage point strays from VDD/2, the harder it is to regulate the circuit. Because designing robust voltage regulators is difficult, supply noise plays a large role in how effective this technique is. For our project, we experimented with full adder topology, implementing 6 single-bit blocks and cascading them into 16-bit adders. To be able to make some sort of generalization as to how good different families of adders are at keeping the middle node at VDD/2, two types of static adders, two transmission logic adders, and two dynamic adders were assessed. Simulation results revealed that static adders stray the least from VDD/2. Keywords: Stacked-VDD, VDD stacking, multi-story power delivery, Cadence, Adder Topology, Static Adders, Transmission Logic Adders, NP CMOS Dynamic Adders Ideally, the voltage drop is always the same across both the bottom half and top half of the circuit and the currents passing through the two halves are also equal at all times. On the contrary, in reality, passing different input signals, at different times, can cause different circuits to utilize more current than others. This can cause all the different Vmiddle’s as shown in Figure 1, to wobble off of the ideal midpoint V DD/2. As we will see later, the stability of this middle node intrinsically determines whether the overall circuit balances current well. Consequently, practical circuit-level definitions of VDD stacking involve closely regulating the voltage of each middle node. This important current balancing strategy pins each of those nodes to the correct fractions of the supply voltage. But designing robust voltage regulators is difficult, and if the currents are not balanced, the difference flows through the voltage regulators . Not only does this result in more power wastage, it also worsens power supply noise because it increases the current through external power grids . So any way to alleviate that electrical strain would contribute to better 1. INTRODUCTION Reducing power supply voltage is necessary as industry scales the width of transistors to smaller sizes, approaching the fundamental limit of CMOS technology. However, with increasing levels of operating current, higher switching speeds, scaling down transistors worsens power supply noise. This trend in power supply noise cripples circuit reliability, power density, and greater system functionality. In the event that decoupling capacitance is not cushioning the effects of power supply noise, such noise has the potential to contribute to excess power dissipation. Exacerbating the power dissipation crisis is definitely unacceptable, as circuits are approaching the power density of a nuclear reactor . The conventional power supply method is not robust but multi-story power delivery is by effectively reducing power supply noise . Multi-story power delivery, or VDD stacking, is where n independent circuits, originally powered by n different power supplies, are grouped into one circuit loop, powered by a single power source of nVDD, and assigned n different voltage domains . Typically, on a chip, “independent” circuits are hooked up to the same power supply, so in reality this means all those different circuits are in parallel with the same singular power supply. Figure 1 illustrates this transformation, possibly complex and different circuits. In effect, this technique allows us to reuse the current of one circuit in each of the others. The benefit of doing so is that it reduces power supply current by 1/n of the original value, lowering power supply noise . Figure 1- VDD stacking applied to n blocks of logic. Multiple Vmiddle nodes must each be carefully voltage regulated in reality, making this a more complex solution. power delivery. Overall, limiting how much noise the voltage regulator experiences would increase the effectiveness of VDD stacking technique. Thus, our central focus is to simulate and record what noise appears at that middle node without voltage regulation, because less noise means the voltage regulator needs to be less robust. The general super-circuit outline we adhered to was the one illustrated by Figure 2, with added functionality. We wanted a circuit that could function as two separate 2-digit 16-bit binary adders computing sums independently, and our project aimed to determine which full adder topology would have the smallest noise margins for a given input vectors. In the methodology, we outline the circuit topologies of those adders, and which single input pattern change, for the sake of simplicity. By doing so, we hope to gain some insight into which circuit family is inherently better at balancing current. 1.1 Literature Review 1.1.1 VDD Stacking Our work largely builds off the findings of Gu and Kim: their simulation results showed a 66% reduction of IR noise, and 5% reduction of total power consumption when a 2-story architecture’s metrics were compared to the respective values for the conventional power delivery schemes . Power used by the circuit blocks stays the same, and the power dissipated in the parasitic resistors is reduced . These seemingly small returns make a big difference in VLSI systems . However, this research group involved an amount of voltage regulation in their simulation, so our results may not be immediately applicable to their work. The findings behind other sources determined how our stacked-VDD circuit should look. For example, one source indicated stacked circuits that are balanced and independent can help achieve higher efficiency circuit for the system , by stacking multipliers. Based on this finding, we decided to use the same adder for the top and bottom block in each setup that computed “different” sums. References  and  make our findings applicable to other uses of VDD stacking, since module assignment and pin packing also influence the effectiveness of VDD stacking. In VLSI, a whole universe of adder topologies optimized for various purposes already exist. Much work focuses on speeding up adders, streamlining the data arrival, tweaking the critical path, opting for tree structures whenever possible. Since there was no way to do all of them justice within the course of a semester and within the span of such a project, we took a simplified our approach to this grand variety of adder topologies. In particular, we focused our interest on 3 different families of adders: static CMOS adders, transmission logic adders, and dynamic adders. We drew as much as we could from the textbook on this matter . Reference  focused on optimizing full adder block circuitry to minimize the power-delay product (PDP), and gauge how sensitive all these topologies are to scaling VDD for static adders, transmission logic adders, and one dynamic adder: their comprehensiveness is commendable. Reference  seeks to minimize the PDP for dynamic and transmission logic adders by making it faster. References  and  served as sources for adder topologies for this project. Reference  gave some insight into the energy efficiency of different low-power logic styles of adders, drawing from transmission logic and static CMOS logic. Figure 2- The two-level stacked-VDD circuit applied to each adder topology. The individual adders each work off of 1.1 V, so in VDD stacking, 2.2 V is required. For each topology, we are concerned with the amount of noise that naturally appears at Vmiddle with respect to ground. A study involving VDD stacking and adders proved hard to find. We did find a paper that involved SRAM banks and VDD stacking, which discussed implications for reducing Vmiddle’s fluctuations around VDD/2 . Other than that, none of the papers cited were concerned with both topics: for the most part, they belong in one or the other, only intersecting at the concern of reducing power dissipation . Herein lays the novelty of our discovery. 1.2 Problem Statement As a whole, this becomes a complex problem but it is highly dependent on two things: (1) the topology of the inner circuits and (2) the input patterns supplied to each of them. Our approach deals with simplifying the possibly complicated effect of these two factors. The topology of the inner circuits determines how current is balanced. In class, we have witnessed how different styles of circuitry utilize and reroute current in fundamentally different ways, to an extent. Static CMOS logic has “opposite” circuits in the pull-up and pull-down networks, always clearly connects the output to VDD or ground, and has good switching power characteristics. Transmission logic’s transmission gates act like switches and thus open up new possibilities, but they have the worst subthreshold leakage characteristics. Dynamic logic only uses a pull-down network, has “Evaluate” and “Precharge” modes, is not always so clearly linked to either VDD or ground, and often has the worst switching power. Dynamic logic is attractive for high-speed applications. But large clock loads and the “Precharge” mechanism result in high power dissipation. Choosing speed as the main purpose of these circuits inherently compromises the robustness of the generated, output signal. Using this piece of information, we Figure 3- Cadence schematic of 16-bit adder implementation. may hypothesize that either the transmission logic adders or the static CMOS adders will result in less noise at Vmiddle. An appropriate simplification here is that if the circuits had the same topology, this may grant the overall circuit better current-balancing properties . We implemented and expanded this circuit (see Figure 2 and Figure 3) in Cadence using six different topologies: CMOS static, Mirror Adder (static), transmission function full adder, and transmission gate adder, and two different varieties of NP CMOS dynamic adders. For a variety of full adder technologies, we took a single-bit full adder block, and cascaded for 16-bits. Using Cadence, we simulated stacked-VDD with each of these different topologies. In particular, we aimed to see which of the adder topologies balanced the current coming out of the middle node naturally. We measured their ability to inherently balance current by measuring the deviation of Vmiddle from VDD/2 = 1.1 V. The input patterns supplied to the top and bottom adders had to be independent to give the circuit better current balancing properties . For measuring the deviation of Vmiddle from VDD/2 = 1.1 V, we want to analyze the worst-case noise, not the worst-case delay. After some preliminary simulations with the Static CMOS and Mirror adder circuits, we observed that the greatest amount of noise resulted when all the inputs changed at the same time, and applied the same reasoning to the rest of the full adder implementations. The Methodology discusses this experimental design choice in further detail. In general, we claim that the static adders will out perform the other topologies because they have the most transistors. his intuition for how the adders worked. When implementing the adders, Wenlu sized all the transistors appropriately, based on the principles we learned in class. For the dynamic full adders, the clock signal consisted of 10-nanosecond pulses in a 50/50 duty cycle. Previous work revealed that if the two blocks of circuitry are symmetric but independent, then the overall circuit is able to balance current better . Based on our preliminary simulations, we can safely claim that when inputs of FA0 and FA1 swing in opposite directions then the greatest amount of noise occurs at the output, causing the most imbalanced V middle conditions. So Wenlu chose to change all three inputs at once, and did his best to try to apply different voltage fluctuations on each. But because there are more than 2 inputs (ABC), and only two changes possible (10 or 01), some sort of overlap had to be made. Hence, as Figure 4 shows, Wenlu chose that A0 would be going from 10, B0 going from 01, A1 going from 01, and B1 going from 10. All of the A’s and B’s undergo this transition for all 16 blocks. Wenlu decided to simplify the 3-input vector stimuli into 2-input vectors by supplying the same signal A has to Cin too for the first block to ensure that the adders were computing completely different sums for the two cycles. But all the subsequent Cin’s were hooked up to the appropriate Cout’s. This setup was applied to both the top and bottom block of logic. For dynamic logic, the clock’s phase is offset from the Thus, the transient output graphs generated and analyzed in the Results portion of this paper is the effect of the aforementioned input patterns on Vmiddle. 2. METHODOLOGY Using Cadence, we simulated the effect of VDD stacking on a variety of 16-bit adder topologies, shown above. Figure 3 shows how the adders’ carry out signals rippled onto the following adder block. A few notes must be made with respect to the circuits we have used. Without stacking, we used transistors that would operate at 1.1 V VDD. With VDD stacking, the two tiers of adders would need one 2.2 Vvoltage source. Without voluntary voltage regulation, the middle node of the circuit on the output side is free to float with respect to ground. In fact, since there are 16 adders in parallel, there are actually 16 independent floating Vmiddle points. Because these adder blocks are in parallel and see the same loads to the left and right, we assume that all 16 of those nodes undergo the same supply noise fluctuations. Wenlu made many independent decisions about the practical circuitry concerns, as he was in charge of the Cadence simulations. He chose the adder topologies he did based on Figure 4-Input vectors applied to all 6 circuit topologies. In Cadence, the effect of this input signal on 16 cascading full adder blocks is simulated. The first block’s Cin = A, all the following Cin = Cout from the previous block as in Figure 3. (a) Circuit diagram for static CMOS full-adder topology (CFA), with 28 transistors per block. As seen in , . (e) Circuit diagram for dynamic NP CMOS adder topology (Dyn1), with 22 transistors per block (4 more than what is shown here to account for the NOT’ed outputs). Each adder block only implemented the highlighted block. As seen in the 1st ed. of our textbook. (b) Circuit diagram for mirror full-adder topology (MFA), with 28 transistors. As seen in , . (c) Circuit diagram for transmission function adder (TFA), with 16 transistors per block. As seen in . (d) Circuit diagram for transmission gate adder (TGA), with 14 transistors per block. As seen in . (f) Circuit diagram for an adjusted dynamic NP CMOS adder topology (Dyn2), to combat the charge sharing problem using as few transistors as possible. This implementation actually required 17 transistors per block, 2 more than what is shown here to account for the NOT’ed Sum. As seen in . Figure 5- The 6 Adder Topologies used. The static adders are (a) the Static CMOS adder (CFA), and (b) the Mirror adder (MFA). The transmission logic adders are (c) Transmission Function adder (TFA), and the (d) Transmission Gate adder (TGA). The dynamic NP CMOS adders are (e) and (f). (a) Cadence Schematic of the CFA block. (d) Cadence Schematic of the TGA block. (b) Cadence Schematic of the MFA block. (e) Cadence Schematic of the Dyn1 block. (c) Cadence Schematic of the TFA block. (f) Cadence Schematic of the Dyn2 block. Figure 6- Schematics of the Cadence implementation of the 6 Adder Topologies used. Again the static adders are (a) and (b), the transmission logic adders are (c) and (d), and the dynamic NP CMOS adders are (e) and (f). Instead of trying to judge whether static, dynamic, or transmission gate adders are naturally better at balancing current using just one type of each, we did two of each type to get a slightly better idea of how these three particular families of adders balance current. The topologies implemented were CMOS static full adder ,, Mirror Full Adder (MFA) ,, transmission function full adder (TFA) , and transmission gate adder (TGA) . As for dynamic adders, two different NP CMOS dynamic adders were used: one from the book [6; 1st ed.], and the other rearranged to minimize charge sharing without increasing the number of transistors in a paper . On the previous pages, Figure 5 shows the more readable circuit schematics for all of the 6 adders. Figure 6 contains the less readable screenshots of the circuit schematic implementations for each of those adders. Thus, in total, using the input vector patterns listed above, we tested 6 different circuits. 3. SIMULATION RESULTS After implementing each of these adders, Wenlu doublechecked to ensure that each of these adder blocks correctly calculated the sum and carry out functions, and they all did work correctly. Transient output generated for ABC going from 101010. When this input pattern is applied, the sum output should transition from 0 to 1, and the carry out output should transistion from 1 to 0: this should occur for all of these logic blocks, regardless of logic style, because this is the definition of a full adder. Figure 7 shows the graphs of what the input and output signals: all of these graphs do not always look the same. One thing to note in Figure 7 is that for the dynamic adders, although the clock pulse is the same length as the input pulse, they are out of phase by 4 nanoseconds. The following graphs, in Figure 8, show the voltage noise at the central node, Vmiddle. Besides the smooth RC curves showing the discharging and charging of internal nodes, other noise exists in different places, and each of these adders exhibited a unique signature. More activity at the bottom level lowers Vmiddle whereas increased activity in top level increase Vmiddle above VDD/2 . So the location of Vmiddle can prove very telling as to which circuit block is at work, and where to look for the action. The shape of these curves aside, the following analysis focuses on the very maximum and minimum points of each adder’s Vmiddle curves, given in Figure 8. 3.1 CMOS Static Adder (CFA) In Figure 8(a), we observe how the transient response transitions from 1.205 to 1.050 V. With respect to VDD/2, that is 0.105 above 1.1 V and 0.050 V below it. Percentage-wise, that is 9.55% above and 4.55% below, on average 7.05%. some very robust voltage regulation could definitely control this much deviation from VDD/2. It is safe to say, this did the best. 3.2 Mirror Adder (MFA) Given that this adder is so symmetric that can almost literally be folded on itself, we chose this adder because we thought it might actually out-perform the CFA in current balancing. But with a transient response that went from 1.148 to 0.985 V (0.048 above 1.1 V and 0.115 V below VDD), it turns out that was not so. Vmiddle ran 4.36% above and 10.5% below VDD/2, but it still ranked among the best adders at an average of 7.43% deviation. 3.3 Transmission Function Adder (TFA) The transient response for the TFA ranged from 1.295 to 1.049 V (0.195 above and 0.051 V below VDD/2). This adder is the best out of the transmission logic adders we tested, although better transmission logic adders may exist. Unfortunately, as we will see later, even the most robust voltage regulation might have trouble reigning in the runaway voltage peaks that deviate 17.7% above and 4.63% below 1.1 V. 3.4 14-Transistor Transmission Gate Adder (TGA) Spanning 1.365 V to 0.790 V, the transient response of this adder does poorly because it goes 0.265 V above 1.1 and 0.310 V below VDD/2. Going 24.1% above and 28.2% below 1.1 V, this adder is long gone beyond the hope of voltage regulation. The low number of transistors may be to blame. 3.5 NP CMOS Dynamic Adder (Dyn1) For the first Dynamic adder, the transient response goes from 1.335 to 0.7 V. That is 0.235 above VDD/2 and 0.4 V below 1.1 V. Percents: 21.4% above and 36.4% below. On average, this adder did the absolute worst at staying at VDD/2. Charge sharing is definitely a problem. 3.6 Adjusted NP CMOS Dynamic Adder (Dyn2) The adjusted dynamic adder did better because the voltage only swung from 1.406 to 0.861 V: in fact, it out performed the TGA. Going only 0.306 (27.8%) above 1.1 V and 0.239 V (21.7%) below that, this was pretty good for a dynamic adder: it out-performs the TGA, with only 17 transistors. But making an adder that combats charge sharing may not be worth it because it still does not reign in Vmiddle to the point that it could be voltage regulated. 4. DISCUSSION In general, the static CMOS full adders did the best at balancing current. Let’s see why. The following equation quantifies how deviations from VDD/2 can depend on activity factor, and capacitances .  The static adders did the best at current balancing because either the pull-up and pull-down networks kept the output point engaged. The static adders gain stability by utilizing their delay in the right way. The transmission gate adders did not have charge sharing issues. The dynamic adders did the worst at current balancing. For them, by making the unit fast, we compromise the stability of the middle voltage point by far. Charge sharing played a significant role in Dyn 1. In reality, even the most robust voltage regulators can handle about +10% to -25% deviation off the point of regulation on the input side. (a) I/O graph for CFA. Notice the noise is exactly mirrored for the top and bottom tiers. (d) I/O graph for TGA. Notice, this adder is very noisy while switching, especially during 0 to 1 transitions. (b) I/O graph for MFA. Notice the noise is exactly mirrored for the top and bottom tiers. (e) I/O graph for Dyn1, with the “Precharge” phase labeled. Charge sharing problems visible in this graph. (c) I/O graph for TFA. Notice the smooth RC curves that occur during switching. (f) I/O graph for Dyn2, with the “Precharge” phase labeled. Charge sharing has been prevented & transitions look fast and nearly perfect. Figure 7- Input and Output signals for all 6 adders. Red dashed line is A0, orange dashed line is A1, pink dashed line is Cout0, and green dashed line is Cout1. as per Figure 4. (d) Vmiddle graph for TGA. (a) Vmiddle graph for CFA. (e) Vmiddle graph for Dyn1, with the “Precharge” phase labeled. (b) Vmiddle graph for MFA. (f) Vmiddle graph for Dyn2, with the “Precharge” phase labeled. (c) Vmiddle graph for TFA. Figure 8- Vmiddle signals for all 6 adders, illustrating the deviations from VDD/2 = 1.1 V. 5. CONCLUSION 6. ACKNOWLEDGMENTS Reducing power supply noise is necessary to prevent worsening the already acute problem of power dissipation in circuitry today. To alleviate the pin limitation problem, a stacked-VDD circuit paradigm has recently been proposed in the literature. To use VDD stacking, usually a voltage regulator needs to maintain the voltages at the appropriate fractions of the supply voltage. But building an effective voltage regulator poses an additional problem: it is difficult to make a robust one so we should see what can be done to limit how much this middle node deviates from VDD/2. However, for a circuit designed using this paradigm, a signiﬁcant amount of power may be wasted if modules are not strictly assigned to different VDD domains. Instead of testing the limits of voltage regulation, for the purpose of this project, by using VDD stacking without voltage regulation, we kept the middle node floating, to quantify the deviations. In particular, we aimed to see which of the adder topologies balanced the current coming out of the middle node naturally. And we found that static adders had the most hope for use in stacked-VDD schemes. A few concerns fell outside of the scope of our project. We did not explore the worst-case scenario of vastly different to avoid the complication of top/bottom block assignment. We decided to look at neither energy nor performance optimization. We did not scale VTh with VDD, considering that the effective VDD per circuit is still roughly the same. We also did not find enough time to simulate the effect of random signal inputs on the middle node’s voltage stability. We did not know how to verify if connecting Cin to A was justified: although, it was a design choice made by Wenlu, we do not believe it gives the worst-case noise for all 6 topologies. Further work could seek to investigate each of the aforementioned concerns not addressed by the work presented in this paper. For example, discovering what input vector pattern makes the worst-case noise would be an important step in the right direction. We could cascade two or more 16-bit adders for each topology. This would probably give a more realistic idea of how these circuits balance current overall. Or, we could make this into a 4-digit 16-bit adder if we coordinated the 16-bit adder above with the one below. This would probably require up or down converters for the pertinent inputs and outputs. Potentially, particular assignment of different adders in the top and bottom tiers of the circuit could help improve the energydelay location of the circuit with respect to the Pareto optimal curve. Typical modern microprocessors and DSP chips contain multiple processing unit (ALU, multiplier, memory blocks, etc.) for concurrent execution, so something to that effect could also reflect the usage of this technique more realistically. Wenlu Sun did all the Cadence simulations, and made most of the experimental design decisions. Jiyati Verma did the Wiki page and most of the documentation the slides for the presentation. It is fair to say that both did 50% of the project. 7. REFERENCES  Alioto, M., and Palumbo, G. Analysis and Comparison on Full Adder Block in Submicron Technology. IEEE Transactions on VLSI Systems, 10, 6 (Dec. 2002), 806823.  Cabe, A. C., Qi, Z., and Stan, M. R. Stacking SRAM Banks for Ultra Low Power Standby Mode Operation. In Design Automation Conference (DAC 2010) (Anaheim, CA, June 13-18, 2010), ACM press, New York, NY, 2010, 699-704.  Gu, J., and Kim, C. H. Multi-Story power delivery for supply noise reduction and low voltage operation. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED 2005) (San Diego, CA, Augusut 8-10, 2005), ACM press, New York, NY, 2005, 192-197.  Liu, C. Voltage Regulation of CMOS Stacked Digital Circuits. Technical Report from Department of Electrical Engineering, Eindhoven University of Technology, The Netherlands, August 30, 2010.  Mirzaee, R. F., Moaiyeri, M. H., and Navi, K. High Speed NP-CMOS and Multi-Output Dynamic Full Adder Cells. International Journal of Electrical and Electronics Engineering, 4, 4 (Spring 2010), 304-310.  Rabaey, J. M., Chandrakasan, A., and Nikolic, B. Digital Integrated Circuits: A Design Perspective. 2nd ed. Prentice Hall, Upper Saddle Review, NJ, 2003. (pp. 565567)  Zhan, Y., and Sapatnekar, S. S. Automated Module Assignment in Stacked- VDD designs for high-efficiency power delivery. ACM Journal on Emerging Technologies in Computing Systems, 4, 4, (October 2008), 18:1-18:20.  Zhan, Y., Zhang, T., and Sapatnekar, S. S. Module Assignment for Pin-Limited Designs under the StackedVDD Paradigm. In Proceedings of the 2007 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2007), (San Jose, CA, November 4-8, 2007), ACM Press, New York, NY, 2007, 656-659.  Zimmermann, R., and Fichtner, W. Low-Power Logic Styles: CMOS Versus Pass-Transistor Logic. IEEE Journal of Solid-State Circuits, 32, 7, (July 1997), 10791090.