Download PICo Digital Signal Processor Design Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pulse-width modulation wikipedia , lookup

Audio power wikipedia , lookup

Electrical substation wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Rectifier wikipedia , lookup

Power inverter wikipedia , lookup

Control system wikipedia , lookup

Power engineering wikipedia , lookup

Opto-isolator wikipedia , lookup

Transmission line loudspeaker wikipedia , lookup

History of electric power transmission wikipedia , lookup

Distribution management system wikipedia , lookup

Power MOSFET wikipedia , lookup

Amtrak's 25 Hz traction power system wikipedia , lookup

Metadyne wikipedia , lookup

Voltage optimisation wikipedia , lookup

Alternating current wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Power electronics wikipedia , lookup

AC adapter wikipedia , lookup

Mains electricity wikipedia , lookup

Buck converter wikipedia , lookup

CMOS wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Transcript
PICo Digital Signal Processor Design Overview
Kenneth Benson
ECE 363 – Spring 2008
University of Virginia
[email protected]
Benjamin Combs
ECE 363 – Spring 2008
University of Virginia
[email protected]
Nicholas Lumsden
ECE 363 – Spring 2008
University of Virginia
[email protected]
ABSTRACT
Our team designed a functional digital signal processor to
compete for a contract with Portable Instruments Company
(PICo). The main goal of our DSP was to minimize the metric
specified by the PICo Board, Metric = Delay2*Area*Power. In
this report we describe our design decisions, ALU arbitrary
function, and the final metrics of our DSP. We also explain the
reasons for our decisions and the tradeoffs we encountered
throughout the process.
1. INTRODUCTION
PICo required a design to support the following 16-bit ALU
operations: ADD, SUBTRACT, SHIFT, PASS A, AND, OR,
NOP, and an arbitrary special feature. The specifications also
required synchronous registers to store the ALU inputs and
output. The clock speed of the DSP was dependant on the worst
case propagation delay of the ALU, which is discussed in further
sections. Our team used tools in Cadence to design the DSP and
Spectre to simulate our designs. This documentation will prove
the correct functionality of each function within the ALU while
quantifying the metrics for the DSP. After reading our design
overview we hope PICo is confident with choosing our design for
the available contract.
2. ALU FUNCTIONS
2.1 PICo Specified Functions
The ALU consists of seven functions and an arbitrary function
determined by our design team. Within the ALU each function is
connected to an 8:1 multiplexor with three control bits to allow
the user to specify the desired operation. The NOP function is a
simple feedback loop from the ALU output to the multiplexor
input that allows the user to simply pass the previous function’s
output. The PASS A function consists of a transmission gate
connected from the A-input to the multiplexor input. The AND
and OR operations were designed by implementing 16 bitwise
AND and OR gates in parallel. The SHIFT operation is
implemented via a multiplexor shifter design using 4:1
multiplexors to shift input A left the amount of bits specified by
the last two bits of input B, where 00 shifts 1 bit, 01 shifts 2 bits,
10 shifts 3 bits, and 11 shifts 4 bits. We implemented the ADD
and SUBTRACT functions by using 16 mirror-carry adders with a
mode bit that, if deasserted, adds input A and input B, and if
asserted, subtracts B from A. Our team also designed a modified
comparator as our arbitrary function to be included in the ALU.
2.2 Comparator Function
We decided to include a modified comparator as our team’s
arbitrary function based on a few factors. The comparator allows
the user to specify whether the ALU should pass the greater or
lesser of the two values, given inputs A and B. Because we only
allow the user to pass the larger or smaller value, we included an
equality bit that is asserted if the two inputs are equal. The
operation works by subtracting input B from input A and
analyzing the most significant bit of the difference. This most
significant bit becomes the control for two 2:1 multiplexors which
pass the greater value and the lesser value of the two inputs. The
outputs of each of these multiplexors are tied to the inputs of
another 2:1 multiplexor that is controlled by the user. The user
can select to pass the lesser value by choosing the control bit to be
zero and the greater value by choosing the control bit to be one. If
the two numbers are equal, the comparator will pass the equal
value but will have an equality bit output of one. We have
included documentation in the end of this report to show accurate
functionality of our arbitrary function and the PICo specified
functions.
Power Consumption: 11.61mW
Area: 3899µm
Delay: 10.3ns
Total Metric: (10.3ns)2 * (11.61mW) * (3899µm) =
4.802 * 106
3. DESIGN METRICS
3.1 Delay Metric
PICo specified the delay metric to represent the worst case access
delay of the DSP. In our design this delay was dominated by the
adder/subtractor circuit. Initially we designed the adder using a
ripple-carry design. However, this design produced a significant
worst-case delay, and considering the number of bits we were
required to add/subtract, we decided to opt for a more efficient
design that would take advantage of the generate/propagate/kill
functions and their respective relationships. We accomplished
this by using a mirror adder design, and added functionality better
error detection in the form of an overflow detection bit (V). This
reduced the total worst-case delay of the ALU from about 12.5ns
to about 9.5ns.
The advantages of using the mirror adder are numerous, but some
key benefits that guided our decision (as opposed to using the
ripple-carry design) were as follows: First, the mirror adder uses 4
Figure 1 – Mirror Adder Design1
less transistors, which translates to an area decrease of 64 total
transistors. Second, in the mirror adder, the NMOS and PMOS
chains are completely symmetrical, meaning a maximum of two
series transistors is seen in the carry circuitry. Lastly, it is only
necessary to size the carry circuitry, meaning the sum circuitry
can stay minimum size and maintain functionality. Knowing that
this mirror adder block would undoubtedly produce the worst
delay, and that delay was our focus in the metric, we wanted to
take the initiative to save delay wherever possible in its circuitry.
We were able to further reduce the total delay by sizing the mirror
adder to keep an optimal fan-out of two by increasing the size of
the carry stage to four times the size of the sum stage. Figure 1
shows the optimal design and sizing of our team’s mirror adder.
The final adder design reduced the ALU delay from 9.5ns to
7.5ns, a 21% improvement linearly, but a 38% increase
quadratically. The cost of this improvement was a 31% increase
in area in each mirror adder block which made the tradeoff a
beneficial one. Including the registers, our total delay (thus
giving us our max clock period,) was 9.693 ns.
3.2 Area Metric
Shifter: 1512µm
8:1 MUX: 3240µm
AND: 144µm
OR: 216µm
PASS A: 72µm
Adder: 2172µm
Level Converter: 1920µm
Total Area Metric: 11,652µm
3.3 Power Metric
The active power of the DSP can be calculated by the average
power formula, P = α*f*C*Vdd2. This shows that power is
directly related to frequency, capacitance, and the square of the
peak voltage. Therefore, any improvement in worst-case delay
will also linearly increase the power in the system. Also it is
necessary to note that decreasing the supply voltage (Vdd) yields
a quadratic decrease in power. Totaling our power metric through
the entire DSP, including the registers, yielded a value of:
23.99mW. To arrive at this metric, our team used two particular
methods to reduce the total power used in our designed.
4. METHODS TO REDUCE POWER
4.1 Enable Control
In our original design all operations were running simultaneously
for each specified input. The multiplexor simply selected the
desired operation and passed its output to the ALU output,
although every function produced an output to the multiplexor.
The incentive to having enable controls in our design is to prohibit
unneeded functions from dissipating excess power. Our first
enable control design took advantage of the second bit from the
multiplexor control to control the usage of each individual
function. This effectively allowed only half of the functions to
consume power simultaneously. The bit was connected to the
gate of an NMOS transistor that passed Vss when the control bit
was asserted. This was inverted and passed to half of the function
blocks when the control bit was high. We used a PMOS to pass
Vdd to the rest of the function blocks when the control bit was
zero. This design decreased the total power but was less stable
and required a large area to ensure proper switching speed, as the
load capacitance was very large.
Our next strategy was to use transmission gates at each input
rather than solitary NMOS and PMOS transistors. This would
allow for smaller areas and much quicker switching, thus
increasing stability and delay time relative to our previous enable
design. However, we were unsuccessful at implementing this
design, which led us to look elsewhere for power optimization.
4.2 Level Converter
Our initial design allowed each function to complete its operation
as fast as possible, leaving the rest of the clock cycle as “wasted
time” in all operations except the worst-case path. From the
power formula we know that if we can reduce Vdd, we will
reduce the total power quadratically. However, reducing the
supply voltage also results in slower transitions from input to
output as a result of decreased current at the output.
Understanding this, we wanted to minimize the wasted time by
lowering the voltage for each function by an amount that would
allow the operation to use as much of the clock cycle as possible.
This meant optimizing smaller gates such as the AND and OR.
feedback to stay within the maximum delay window, yet avoid
sizing too large to gain from the modification. Each level
converter, as a result, has NMOS transistors sized to 10µm each,
yielding an area of 30µm per converter.
However, driving a gate supplied by a higher supply voltage by a
gate supplied by a lower supply voltage results in the burning of
static power because the higher potential PMOS transistor never
fully turns off. To avoid this problem we implemented level
converters in four of the seven functions in our ALU. The level
converter, shown in Figure 2, uses two cross-coupled PMOS
devices that provide positive feedback to convert the lower supply
voltage to a higher voltage, which ultimately allowed us to
implement our strategy within the same ALU block. Using this
power reducing strategy, we were able to lower the system’s total
power output by 23%, from 31.08mW to 23.99mW using a
lowered supply voltage of 1.5V. The trade-off here is an increase
in area of 1920µm, which is approximately 16.4% of the total
area, meaning this was a beneficial modification. The schematic
shown in the supplemental material shows that we optimized the
AND, OR, Pass A, and Shift functions to operate at this voltage.
With regard to our multiplexor shifter, a necessary design
decision was made with regard to delay as well. Two designs
were implemented, one using transmission gates similar to a
barrel shifter and one using the multiplexor design chosen in our
ALU. The original transmission gate shifter used approximately
200 transistors, yielding a size of approximately 700µm, while the
multiplexor shifter yielded a size of approximately 1512µm. This
is a relative difference of about 50% more transistors, which is a
7% total area. However, the delay tradeoff is that we avoid using
multiple decoders for determining shift amounts as well as drivers
at the outputs (also increasing area and power consumption.)
Overall, this resulted in a more efficient use of power and a
shorter delay time.
The overhead of this design may cause this strategy to be
impractical in certain situations. Each level converter requires a
different lower supply voltage depending on the initial delay of
the corresponding function. Therefore we would have to include
a total of five voltage sources in the DSP, one for each level
converter and one for the higher Vdd. The PICo board must
decide if the applications of the processor would allow for the
additional embedded voltage sources.
6. METRIC RESULTS
Total worst-case propagation delay: 9.67ns
Test case: Adding 0x7FFF and 0x0001
Total power consumption: 11,652µm
Test case: A alternating between 0xAAAA and 0x5555
every cycle, B constant at 0xAA55, measured over 8
total clock cycles
Total calculated area: 23.99mW
Test case: Addition of all transistor widths
Final Metric Calculation: (9.67ns)2 * (11,652µm) * (23.99mW) =
2.6139 * 107
7. CONCLUSIONS
Figure 2 – Level Converter Schematic2
5. METHODS TO REDUCE DELAY
According to PICo’s metric specification, the system’s total delay
is the most important metric for our team to consider as it is
quadratic relative to the other metrics in the specification. As
listed before, we focused on sizing certain gates correctly to
satisfy the best logical effort possible. This was mainly focused
on the adder but it was also necessary to size some of the buffers
in a manner to drive much larger gates within the circuit. The
level converter was another focus of sizing in that it is very delaysensitive to lower values of VDDL (the reduced supply voltage.)
It was necessary to increase the size of the NMOS transistors in
such a way that they could overpower the PMOS positive
In conclusion, our team feels that our DSP design is a more than
adequate choice for PICo. We deliver high speed, low power
consumption, and minimal area for the solid performance and
robust functionality our design achieves. Our qualified engineers
have verified these metrics sufficiently in the following
supplemental materials. As an added bonus we include an
arbitrary function that outputs the greater or lesser of two values,
or whether they are equivalent, depending on the selection by the
user. This is very useful in that it can be used for branching in a
datapath, which is an important, and above all, very common, task
in modern computing. This adds versatility when utilizing our
DSP in conjunction with other data components, as it surely will
be implemented in a processing intensive environment.
We thank you for consideration of our design in the PICo design
project.
8. REFERENCES
1 - http://bwrc.eecs.berkeley.edu/icbook/slides.htm
2 - http://www.freepatentsonline.com/6741230-0-large.jpg