Download PICo Digital Signal Processor Design Overview

PICo Digital Signal Processor Design Overview Kenneth Benson ECE 363 – Spring 2008 University of Virginia [email protected] Benjamin Combs ECE 363 – Spring 2008 University of Virginia [email protected] Nicholas Lumsden ECE 363 – Spring 2008 University of Virginia [email protected] ABSTRACT Our team designed a functional digital signal processor to compete for a contract with Portable Instruments Company (PICo). The main goal of our DSP was to minimize the metric specified by the PICo Board, Metric = Delay2*Area*Power. In this report we describe our design decisions, ALU arbitrary function, and the final metrics of our DSP. We also explain the reasons for our decisions and the tradeoffs we encountered throughout the process. 1. INTRODUCTION PICo required a design to support the following 16-bit ALU operations: ADD, SUBTRACT, SHIFT, PASS A, AND, OR, NOP, and an arbitrary special feature. The specifications also required synchronous registers to store the ALU inputs and output. The clock speed of the DSP was dependant on the worst case propagation delay of the ALU, which is discussed in further sections. Our team used tools in Cadence to design the DSP and Spectre to simulate our designs. This documentation will prove the correct functionality of each function within the ALU while quantifying the metrics for the DSP. After reading our design overview we hope PICo is confident with choosing our design for the available contract. 2. ALU FUNCTIONS 2.1 PICo Specified Functions The ALU consists of seven functions and an arbitrary function determined by our design team. Within the ALU each function is connected to an 8:1 multiplexor with three control bits to allow the user to specify the desired operation. The NOP function is a simple feedback loop from the ALU output to the multiplexor input that allows the user to simply pass the previous function’s output. The PASS A function consists of a transmission gate connected from the A-input to the multiplexor input. The AND and OR operations were designed by implementing 16 bitwise AND and OR gates in parallel. The SHIFT operation is implemented via a multiplexor shifter design using 4:1 multiplexors to shift input A left the amount of bits specified by the last two bits of input B, where 00 shifts 1 bit, 01 shifts 2 bits, 10 shifts 3 bits, and 11 shifts 4 bits. We implemented the ADD and SUBTRACT functions by using 16 mirror-carry adders with a mode bit that, if deasserted, adds input A and input B, and if asserted, subtracts B from A. Our team also designed a modified comparator as our arbitrary function to be included in the ALU. 2.2 Comparator Function We decided to include a modified comparator as our team’s arbitrary function based on a few factors. The comparator allows the user to specify whether the ALU should pass the greater or lesser of the two values, given inputs A and B. Because we only allow the user to pass the larger or smaller value, we included an equality bit that is asserted if the two inputs are equal. The operation works by subtracting input B from input A and analyzing the most significant bit of the difference. This most significant bit becomes the control for two 2:1 multiplexors which pass the greater value and the lesser value of the two inputs. The outputs of each of these multiplexors are tied to the inputs of another 2:1 multiplexor that is controlled by the user. The user can select to pass the lesser value by choosing the control bit to be zero and the greater value by choosing the control bit to be one. If the two numbers are equal, the comparator will pass the equal value but will have an equality bit output of one. We have included documentation in the end of this report to show accurate functionality of our arbitrary function and the PICo specified functions. Power Consumption: 11.61mW Area: 3899µm Delay: 10.3ns Total Metric: (10.3ns)2 * (11.61mW) * (3899µm) = 4.802 * 106 3. DESIGN METRICS 3.1 Delay Metric PICo specified the delay metric to represent the worst case access delay of the DSP. In our design this delay was dominated by the adder/subtractor circuit. Initially we designed the adder using a ripple-carry design. However, this design produced a significant worst-case delay, and considering the number of bits we were required to add/subtract, we decided to opt for a more efficient design that would take advantage of the generate/propagate/kill functions and their respective relationships. We accomplished this by using a mirror adder design, and added functionality better error detection in the form of an overflow detection bit (V). This reduced the total worst-case delay of the ALU from about 12.5ns to about 9.5ns. The advantages of using the mirror adder are numerous, but some key benefits that guided our decision (as opposed to using the ripple-carry design) were as follows: First, the mirror adder uses 4 Figure 1 – Mirror Adder Design1 less transistors, which translates to an area decrease of 64 total transistors. Second, in the mirror adder, the NMOS and PMOS chains are completely symmetrical, meaning a maximum of two series transistors is seen in the carry circuitry. Lastly, it is only necessary to size the carry circuitry, meaning the sum circuitry can stay minimum size and maintain functionality. Knowing that this mirror adder block would undoubtedly produce the worst delay, and that delay was our focus in the metric, we wanted to take the initiative to save delay wherever possible in its circuitry. We were able to further reduce the total delay by sizing the mirror adder to keep an optimal fan-out of two by increasing the size of the carry stage to four times the size of the sum stage. Figure 1 shows the optimal design and sizing of our team’s mirror adder. The final adder design reduced the ALU delay from 9.5ns to 7.5ns, a 21% improvement linearly, but a 38% increase quadratically. The cost of this improvement was a 31% increase in area in each mirror adder block which made the tradeoff a beneficial one. Including the registers, our total delay (thus giving us our max clock period,) was 9.693 ns. 3.2 Area Metric Shifter: 1512µm 8:1 MUX: 3240µm AND: 144µm OR: 216µm PASS A: 72µm Adder: 2172µm Level Converter: 1920µm Total Area Metric: 11,652µm 3.3 Power Metric The active power of the DSP can be calculated by the average power formula, P = α*f*C*Vdd2. This shows that power is directly related to frequency, capacitance, and the square of the peak voltage. Therefore, any improvement in worst-case delay will also linearly increase the power in the system. Also it is necessary to note that decreasing the supply voltage (Vdd) yields a quadratic decrease in power. Totaling our power metric through the entire DSP, including the registers, yielded a value of: 23.99mW. To arrive at this metric, our team used two particular methods to reduce the total power used in our designed. 4. METHODS TO REDUCE POWER 4.1 Enable Control In our original design all operations were running simultaneously for each specified input. The multiplexor simply selected the desired operation and passed its output to the ALU output, although every function produced an output to the multiplexor. The incentive to having enable controls in our design is to prohibit unneeded functions from dissipating excess power. Our first enable control design took advantage of the second bit from the multiplexor control to control the usage of each individual function. This effectively allowed only half of the functions to consume power simultaneously. The bit was connected to the gate of an NMOS transistor that passed Vss when the control bit was asserted. This was inverted and passed to half of the function blocks when the control bit was high. We used a PMOS to pass Vdd to the rest of the function blocks when the control bit was zero. This design decreased the total power but was less stable and required a large area to ensure proper switching speed, as the load capacitance was very large. Our next strategy was to use transmission gates at each input rather than solitary NMOS and PMOS transistors. This would allow for smaller areas and much quicker switching, thus increasing stability and delay time relative to our previous enable design. However, we were unsuccessful at implementing this design, which led us to look elsewhere for power optimization. 4.2 Level Converter Our initial design allowed each function to complete its operation as fast as possible, leaving the rest of the clock cycle as “wasted time” in all operations except the worst-case path. From the power formula we know that if we can reduce Vdd, we will reduce the total power quadratically. However, reducing the supply voltage also results in slower transitions from input to output as a result of decreased current at the output. Understanding this, we wanted to minimize the wasted time by lowering the voltage for each function by an amount that would allow the operation to use as much of the clock cycle as possible. This meant optimizing smaller gates such as the AND and OR. feedback to stay within the maximum delay window, yet avoid sizing too large to gain from the modification. Each level converter, as a result, has NMOS transistors sized to 10µm each, yielding an area of 30µm per converter. However, driving a gate supplied by a higher supply voltage by a gate supplied by a lower supply voltage results in the burning of static power because the higher potential PMOS transistor never fully turns off. To avoid this problem we implemented level converters in four of the seven functions in our ALU. The level converter, shown in Figure 2, uses two cross-coupled PMOS devices that provide positive feedback to convert the lower supply voltage to a higher voltage, which ultimately allowed us to implement our strategy within the same ALU block. Using this power reducing strategy, we were able to lower the system’s total power output by 23%, from 31.08mW to 23.99mW using a lowered supply voltage of 1.5V. The trade-off here is an increase in area of 1920µm, which is approximately 16.4% of the total area, meaning this was a beneficial modification. The schematic shown in the supplemental material shows that we optimized the AND, OR, Pass A, and Shift functions to operate at this voltage. With regard to our multiplexor shifter, a necessary design decision was made with regard to delay as well. Two designs were implemented, one using transmission gates similar to a barrel shifter and one using the multiplexor design chosen in our ALU. The original transmission gate shifter used approximately 200 transistors, yielding a size of approximately 700µm, while the multiplexor shifter yielded a size of approximately 1512µm. This is a relative difference of about 50% more transistors, which is a 7% total area. However, the delay tradeoff is that we avoid using multiple decoders for determining shift amounts as well as drivers at the outputs (also increasing area and power consumption.) Overall, this resulted in a more efficient use of power and a shorter delay time. The overhead of this design may cause this strategy to be impractical in certain situations. Each level converter requires a different lower supply voltage depending on the initial delay of the corresponding function. Therefore we would have to include a total of five voltage sources in the DSP, one for each level converter and one for the higher Vdd. The PICo board must decide if the applications of the processor would allow for the additional embedded voltage sources. 6. METRIC RESULTS Total worst-case propagation delay: 9.67ns Test case: Adding 0x7FFF and 0x0001 Total power consumption: 11,652µm Test case: A alternating between 0xAAAA and 0x5555 every cycle, B constant at 0xAA55, measured over 8 total clock cycles Total calculated area: 23.99mW Test case: Addition of all transistor widths Final Metric Calculation: (9.67ns)2 * (11,652µm) * (23.99mW) = 2.6139 * 107 7. CONCLUSIONS Figure 2 – Level Converter Schematic2 5. METHODS TO REDUCE DELAY According to PICo’s metric specification, the system’s total delay is the most important metric for our team to consider as it is quadratic relative to the other metrics in the specification. As listed before, we focused on sizing certain gates correctly to satisfy the best logical effort possible. This was mainly focused on the adder but it was also necessary to size some of the buffers in a manner to drive much larger gates within the circuit. The level converter was another focus of sizing in that it is very delaysensitive to lower values of VDDL (the reduced supply voltage.) It was necessary to increase the size of the NMOS transistors in such a way that they could overpower the PMOS positive In conclusion, our team feels that our DSP design is a more than adequate choice for PICo. We deliver high speed, low power consumption, and minimal area for the solid performance and robust functionality our design achieves. Our qualified engineers have verified these metrics sufficiently in the following supplemental materials. As an added bonus we include an arbitrary function that outputs the greater or lesser of two values, or whether they are equivalent, depending on the selection by the user. This is very useful in that it can be used for branching in a datapath, which is an important, and above all, very common, task in modern computing. This adds versatility when utilizing our DSP in conjunction with other data components, as it surely will be implemented in a processing intensive environment. We thank you for consideration of our design in the PICo design project. 8. REFERENCES 1 - http://bwrc.eecs.berkeley.edu/icbook/slides.htm 2 - http://www.freepatentsonline.com/6741230-0-large.jpg

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PICo Digital Signal Processor Design Overview