Download Proceedings Template - WORD - UVA ECE Wiki

Team MUX DSP Adam Burton Mark Colombo David Moore Daniel Toler ECE 3663 – Spring 2010 University of Virginia <adb7d,mjc8m,dmm4ce,dct4w> @virginia.edu ABSTRACT In this paper, we describe the function of a DSP designed according to the specifications provided by PICo. 1. INTRODUCTION The Team MUX DSP is designed to exceed the goals outlined by the PICo specifications. The ALU designed includes all of the functions requested by PICo, and also implements a multiplier. A metric was provided which weighed delay, area, and energy usage in evaluating the DSP; our results according to this metric are investigated, and our means of optimizing this metric are discussed. 2. DESIGN DESCRIPTION PICo specified four main blocks which would comprise the DSP: two input registers, and ALU, and and output register. In addition, the control input signals and a carry out signal were available in order to operate the ALU. Our DSP is comprised of the four main blocks specified by PICo; in addition, we have placed input buffers between the input registers and the ALU to reduce the fanout and thus improve the latency. The ALU has an eight-to-one multiplexer that takes the three bit control input and selects the output of one of the eight functions which the ALU can perform. There are five functional blocks connected to the mux, in order to fulfill PICo’s requirements. These include a 16 bit OR gate, a 16bit AND gate, a adder/subtractor, a shifter and a multiplier The adder/subtractor fulfills two functions, and thus is connected to two inputs of the of the multiplexer. The function of this particular block is described below. The remaining two inputs are used for the NOP and Pass A operations, which merely require moving data from the output of one register to the input of another. The 16 bit AND gate and the 16bit OR gate are comprised of an array of 2 bit OR/AND gates. Because these gates were so simple and don’t have any very long paths (unlike the adder), we thought it best to keep them at their minimum size. By doing this we hoped to minimize power and area while not affecting delay, which would be determined by other, more complicated components. The adder/subtractor we used is a 16bit ripple-carry mirror adder. When the block receives a subtraction signal, the carry-in is one and the B inputs are inverted. Performing this operation is equivalent to converting B to a negative number using two’s complement, and thus subtraction is performed. For addition, the carry-in is zero and both inputs are sent directly to the adder. We chose to combine the adder and subtractor to improve performance significantly on the area and power aspects of the metric. One matter of interest is the generation of the subtraction signal. Because of the assignment of the control signals as indicated in Table 1, it would have been possible to use the least significant control bit as the subtraction signal. On the other hand, it is also possible to generate the signal by using combinational logic to generate a subtraction signal only when the 011 is presented as the control. We chose the second option, because while it would require slightly more area, we believed it might reduce power used in some cases due to the ALU switching between addition and subtraction when neither was being performed. The shifter shifts the contents of register A by an amount determined by the two least significant bits of register B, and is comprised of 16 small shifter bitslices, which each consist of two stages of pass gate logic. Each stage of pass gate logic is designed to forward the signal to a different line depending on whether a specific input bit is high or low, with one input bit activating an additional shift of 1, and the other indicating an additional shift of 2. These additional shifts are in addition to the default shift of 1, allowing the value in register A to be shifted up to 4 bits to the left. Inverters acting as buffer in between each stage. While the pass gate logic reduces output swing, this does not become an issue because the inverters boost the signal to full swing. In addition, the use of pass gates instead of transmission gates, which would still require some sort of buffer, saves significant area. Table 1. Control Inputs and DSP Outputs Control Output 000 NOP 001 Pass A 010 Add 011 Subtract 100 Shift A 101 OR 110 AND 111 Multiply The multiplier can handle two 16 bit inputs. However, because our ALU only has 16 bit outputs the values of the inputs must be limited to ensure that the output will be less than 2^15. The upside to this is that the DSP maintains compatibility with 16 bit systems. The multiplier is an array type, because the extra function does factor into the metric, and because 16 bit words are relatively small so the added complexity of a faster multiplier is unnecessary. 3. INNOVATION 3.1 Sizing It was determined during the second design review that the adder/subtractor unit would be the critical path. Therefore, aside from the adder/subtractor, all blocks are designed to have the same pull up and pull down network resistance as a characteristic inverter, in order to save area and power. In order to determine the sizing for optimal delay of the adder, we hoped to use logical effort [1]. Computing the exact logical effort result for the adder/subtractor proved challenging, but using such a calculation we realized that the inputs to the ALU would have to be buffered with a larger buffer to drive all 8 functions, since we could not reduce most of the width of the functions. We used a buffer that corresponded to a fanout of 4 from the register in order to avoid making the buffer itself have too much capacitance. This proved to be sufficient, and we chose not to have too large a buffer in order to avoid excess area. The logical effort result also seemed to indicate that a very gradual tapering of transistor sizes in the adder/subtractor would be beneficial to delay. Because the portions of the adders which produced the sum outputs were not on the critical path, they remained at the characteristic size in order to save power and area. On the other hand, the portions of the adders which produced the carry out are very important to delay, and thus were what needed to be tapered. With this in mind, simulations were performed to find the best possible tapering rate, in order to reduce the delay by about 40% from the untapered design. 3.2 Trade-offs One of the trade-offs we dealt with was the type of adder to use. We were debating using a carry look-ahead or another version of the ripple carry adder. We ultimately decided on the mirror adder because it used a small number of transistors, and it was faster than a standard ripple carry adder. The balance of speed and small area was important. In addition, we decided to make use of the inversion property of the adder to improve the worst case delay. Using the inversion property allows us to reduce the number of gates in the critical path, and thus improve the worst-case delay of the adder. This is extremely important to the delay, because the adder’s worst-case delay is the most important factor in determining the maximum operating frequency of the circuit. 3.3 Multiplier The binary multiplier takes in two 16 bit binary numbers and multiplies A by B. The multiplier gives a 16 bit output, which means that the most significant bits of larger products will not be available to be output. Nevertheless, we chose to create a 16 bit multiplier rather than an 8 bit multiplier in order to allow the user of the DSP to perform options such as the multiplication of a 12 bit number by 3. 3.4 Other Factors In addition to the innovations described above, PICo should note our choice of supply voltage. Using a supply of 5V, we were able Improve delay over what would be possible with lower supply voltages. The cost of increased supply voltage is increased power; however, PICo’s metric for evaluating the circuit has a quadratic dependence on delay, while having only linear dependence on power and area. Thus we decided it would be best to choose a supply which would improve the most important aspect of the metric. 4. RESULTS In this section we will present the components of the metric and our final value for the metric, and will also discuss the performance of our arbitrary function, the multiplier, which is not included in the computation of the metric. 4.1 Metric The first and most heavily weighted component of the metric is delay. As discussed in section 3.1, the sizing of the adder/subtractor and the use of buffers between the registers and the large capacitance of the eight functions proved to be crucial to reducing delay. In order to analyze the worst case delay, we used the control input for addition, and added A=0x0000 with B=0x7FFF, producing 0x7FFF at the output. We then changed A to 0x0001, thus producing an output of 0x8000, a change which required the carry to propagate from the least significant one bit adder to the most significant one bit adder. It proved possible to have this change occur at the ouput with a period as short as 7 ns, corresponding to an operating frequency of 143 MHz. The next component of the metric is area, which was computed by adding the widths of all the transistors in the device. After reducing the area as much as possible while still maintaining speed and functionality, we computed a final area of 4221 um. PICo specified a routine to be used to measure average energy consumption of the DSP. This routine consisted of cycling through all the functions of the ALU while alternating the A and B inputs. By calculating the instantaneous power drawn by the circuit and integrating over the period of operation, we arrived at an average energy usage of 2.34 nJ. The metrics discussed above are outlined in the table below, and the final value of the metric is given. Table 2. Metric Components and Metric Final Value Component Value Delay, D 7*10-9 s Area, A 4.221*10-3 m Energy, E 2.3426*10-9 J Metric = D^2*A*E 4.846*10-28 s2*m*J 4.2 Multiplier Performance While the design of our multiplier is focused more on offering superior functionality than superior performance, in this section, we present some results from the multiplier circuit. The multiplier netlist was created by a writing a BASIC program that can create netlists for an N-bit multiplier of arbitrary N. A worst-case delay of _______ for the multiplier occurs for the following transition: A=1111111111111111, B=0000000000000001 transitions to A=1111111111111111, B=1111111111111111. This proved to be difficult to simulate due to the complexity of the circuit and the speed of the hardware being used. 5. CONCLUSION This paper has examined in detail the design and performance of a DSP designed to PICo’s specifications. We believe that the Team MUX DSP is the best choice for PICo, due to its superior delay, area, and energy usage perfomance. Our thoughtful design and implementation has led to a low score on the metric and clever product that exceeds the requirements. In addition, the 16 bit binary multiplier is a useful extra function that most designs will lack. Since multiplication is such a common function, any design which lacks it will present PICo with a severe handicap. For these reasons, the DSP designed by Team MUX should be chosen. 6. REFERENCES [1] I. Sutherland, B. Sproull, and D. Harris, Logical Effort, Morgan Kaufmann, 1999.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Proceedings Template - WORD - UVA ECE Wiki