Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Combined unsigned and two's complement saturating multipliers Michael J. Schultey , Mustafa Goky, Pablo I. Balzolay, and Robert W. Brocatoz y EECS Dept., Lehigh University, Bethlehem, PA 18015, USA. z Digital Microelectronics, Sandia National Laboratories, Albuquerque, NM 87185 ABSTRACT In many digital signal processing and multimedia applications, results that overow are saturated to the most positive or most negative representable number. This paper presents ecient techniques for performing saturating n-bit integer multiplication on unsigned and two's complement numbers. Unlike conventional techniques for saturating multiplication, which compute a 2n-bit product and then examine the n most signicant product bits to determine if overow has occurred, the techniques presented in this paper compute only the (n + 1) least signicant bits of the product. Specialized overow detection units, which operate in parallel with the multiplier, determine if overow has occurred and the product should be saturated. These techniques are applied to designs for saturating array multipliers that perform either unsigned or two's complement saturating integer multiplication, based on an input control signal. Compared to array multipliers that use conventional methods for saturation, these multipliers have about half as much area and delay. Keywords: Saturation, overow detection, unsigned, two's complement, multipliers, computer arithmetic. 1. INTRODUCTION Saturating arithmetic is used in many digital signal processing and multimedia applications including digital ltering, speech encoding and decoding, red-green-blue color computations, and image shading [1], [2], [3]. Because of its usefulness in these types of applications, support for saturating arithmetic has been added to several instruction set architecture extensions [1], [4], [5], [6]. Saturating arithmetic operations are also supported on most commercial digital signal processors [7]. With saturating multiplication, products that overow are saturated to the most positive or most negative representable number [7]. For unsigned saturating multiplication, if the product is too large to represent, it is saturated to the largest representable number. With two's complement saturating multiplication, if the magnitude of the product is too large to represent, it is saturated to either the most positive or most negative representable number, depending on its sign. Conventional n-bit saturating integer multipliers produce a 2n-bit product. The n most signicant bits of the product are then examined to determine if the product should be saturated. This paper presents ecient techniques for designing unsigned, two's complement, and combined unsigned and two's complement saturating integer multipliers. These multipliers compute only the (n + 1) least signicant bits of the product. Specialized overow detection units, which operates in parallel with the multiplier, determine if the product should be saturated. Since the overow detection units presented in this paper have linear delay, they are particularly well suited for use with array multipliers [8], [9]. The fundamental equations for overow detection, however, are independent of the multiplier structure, and can also be applied to tree multipliers [10], [11], [12], Boothencoded multipliers [13], [14], and multipliers that use other architectures [15]. The overow detection structures presented in this paper are similar to those presented in [16], except they do not use carry bits from the multiplier and are more easily adapted to designs for combined unsigned and two's complement saturating multiplication. 2. UNSIGNED SATURATING MULTIPLIERS Figure 1 shows the multiplication matrix for an n-bit unsigned integer multiplication. In this gure, the n-bit multiplicand A = an?1 an?2 : : : a1 a0 is multiplied by the n-bit multiplier B = bn?1bn?2 : : : b1 b0 to produce a 2n-bit product P = p2n?1 p2n?2 : : : p1 p0 . The values of A, B , and P are A= X ai 2 i n?1 i=0 B= X bi 2i n?1 i=0 P= X pi 2i 2n?1 i=0 (1) an-1 b0 an-2 b0 an-1 b1 an-2 b1 a1 bn-2 a0 bn-2 an-1 bn-2 an-2 b n-2 an-1 bn-1 an-2 bn-1 p2n-1 p2n-2 p2n-3 Figure 1. a1 b0 a0 b0 a1 b1 a0 b1 a1 bn-1 a0 bn-1 pn pn-1 p1 p0 Unsigned multiplication matrix for P = A B If the n least signicant bits of the product are used as the result, overow occurs and the product should be saturated if 2n P . The conventional method for detecting this is to compute the entire 2n-bit product and then detect overow as Vu = p2n?1 + p2n?2 + : : : + pn+1 + pn (2) where + denotes logical OR. If overow occurs, Vu = 1 and the n least signicant product bits are all set to ones, which corresponds to 2n ? 1, the largest representable number. This is accomplished by ORing each of the n least signicant product bits, p0 to pn?1 , with Vu . Thus, the bits of the saturated product < P >=< pn?1 >< pn?2 > : : : < p1 >< p0 > (3) are computed as < pi >= pi + Vu (4) for 0 i n ? 1. A block diagram of an unsigned 8-bit array multiplier that performs conventional saturating multiplication is shown in Figure 2. The cells along each diagonal in the array multiplier correspond to a column in the multiplication matrix. In this diagram, a modied half adder (MHA) cell consists of an AND gate and a half adder (HA). The AND gate generates a partial product bit, and the HA adds the generated partial product bit and a partial product bit from the previous row to produce a sum bit and a carry bit. Similarly, a modied full adder (MFA) consists of an AND gate, which generates a partial product bit, and a full adder (FA) that adds the partial product bit and the sum and carry bits from the previous row. The bottom row of adders produces the n most signicant product bits. At the bottom of the array, (n ? 1) OR gates combine the n most signicant product bits to detect overow. If any of these bits is one, then Vu = 1 and the n OR gates on the right side of the array set product bits p0 to pn?1 to one. An n-bit unsigned array multiplier that uses this technique has n2 AND gates, (2n ? 1) OR gates, n HAs, and (n2 ? 2n) FAs. The worst case delay is approximately equal to the delay through one AND gate, three OR gates, two HAs, and (2n ? 4) FAs. With our proposed technique for unsigned multiplication, only p0 to pn are computed. Overow is detected by examining the value of pn and the total number of leading zeros in the two input operands, A and B . This technique works because the number of leading zeros in A and B indicate the magnitude of P . If A has ZA leading zeros and B has ZB leading zeros, then A, B , and P are bounded by (5) 2n?ZA ?1 A 2n?ZA ? 1 n ? Z ? 1 n ? Z B B ?1 (6) 2 B 2 (7) 22n?ZA ?ZB ?2 P 22n?ZA ?ZB ? 2n?ZA ? 2n?ZB + 1 Since the left side of (7) provides a lower bound on the product, overow for unsigned multiplication occurs if 2n 22n?ZA ?ZB ?2 P (8) or equivalently if ZA + Z B n ? 2 (9) b0 a7 a6 a5 a4 a3 a2 a1 a0 AND AND AND AND AND AND AND AND p0 b1 b2 b3 b4 b5 b6 b7 AND AND AND AND AND AND AND MHA MHA MHA MHA MHA MHA MHA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs FA c s p15 OR Figure 2. p14 OR FA FA c s p13 OR FA c s c s p12 OR p11 OR FA c s p10 OR FA c s p9 OR < p0 > OR < p1 > OR < p2 > OR < p3 > OR < p4 > OR < p5 > OR < p6 > OR < p7 > p1 p2 p3 p4 p5 p6 p7 HA c s p8 Vu OR Conventional 8-bit unsigned saturating array multiplier. Thus, overow occurs, if A and B together have fewer than n ? 1 leading zeros. The right size of (7), gives the upper bound (10) P 22n?ZA ?ZB ? 2n?ZA ? 2n?ZB + 1 < 22n?ZA ?ZB If A and B together have n ? 1 or more leading zeros (i.e., n ? 1 ZA + ZB ), then (10) gives P < 22n?ZA ?ZB 2n+1 (11) In this case, P can always be represented as an unsigned integer with (n + 1) bits and overow occurs if and only if pn = 1. Thus, for unsigned multiplication, overow occurs and the product saturates if and only if ZA + ZB n ? 2 or pn = 1. The condition given in (9) is detected as Vu0 = X X an?j bi n?1 i i=1 j=1 (12) where the bit summations and bit multiplications correspond to logical ORs and logical ANDs, respectively, and Vu0 = 1 when (9) is true. Rather than computing n (n ? 1)=2 bit products and then ORing these bit products, a signicant hardware reduction is achieved by taking advantage of common terms in Equation (12). With this approach, overow is detected using the iterative equations oi+1 = oi + an?i (13) vi+1 = vi + oi+1 bi (14) for 2 i n ? 1, where oi is a temporary OR bit and vi is a temporary overow bit. Initially, o2 = an?1 and v2 = an?1 b1. After (n ? 2) iterations of Equations (13) and (14), the overow ag is computed as Vu = vn + pn , where vn is equivalent to Vu0 from Equation (12). If Vu = 1, then p0 to pn are set to one. This technique is similar to the technique for unsigned saturating multipliers presented in [16], except that it computes (n + 1) product bits and does not use carries into column n when detecting overow. An 8-bit unsigned array multiplier that uses our proposed technique is shown in Figure 3. Compared to the conventional saturating array multiplier, the hardware that computes pn+1 to p2n?1 and Vu is replaced by an b0 a7 a6 a5 a4 a3 a2 a1 a0 AND AND AND AND AND AND AND AND p0 b1 OR AND o2 b2 MHA MHA MHA MHA MHA MHA MHA cs cs cs cs cs cs cs v2 a6 OR A2X s b3 a5 MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs s MFA MFA MFA MFA MFA cs cs cs cs cs a4 A2X s ns U MFA MFA MFA cs cs cs cs d ne ov s fl er MFA MFA cs cs cs s a1 MFA MFA cs cs it A2X s Figure 3. v5 OVD v6 b6 OVD < p6 > v7 b7 OVD v8 OR vu p7 OR p8 b5 o7 XOR s OVD p6 MFA cs < p5 > p8 a1 v4 o6 OR un b7 b4 o5 < p4 > a2 OVD p5 n io ct te de A2X < p3 > a3 v3 b3 p4 OR ow b6 MFA a2 < p2 > a4 OVD o4 OR A2X b2 p3 ig b5 MFA a3 < p1 > a5 v2 o3 OR b4 o2 a6 p2 OR A2X < p0 > p1 Unsigned overflow detection unit < p7 > Vu Proposed 8-bit unsigned saturating array multiplier. unsigned overow detection unit, the MFAs along the diagonal that produce pn are replaced by A2X cells, and the HA in the bottom right corner of the array is replaced by an (exclusive-or) XOR gate. The unsigned overow detection unit consists of one OR gate and (n ? 2) overow detection (OVD) cells. Each OVD cell computes oi+1 and vi+1 , based on Equations 13 and 14, using one AND gate and two OR gates. Each A2X cell contains one AND gate, which generates a partial product bit, and two XOR gates, which compute the sum of the generated partial product bit and sum and carry bits from the previous row. A2X cells are used instead of MFAs, since the carry bits from this diagonal are no longer needed. The XOR gate uses sum and carry bits from the previous row to produce pn . An n-bit unsigned saturating array multiplier that uses our proposed technique requires (n2 + 5n ? 6)=2 AND gates, (3n ? 3) OR gates, (2n ? 3) XOR gates, (n ? 1) HAs, and (n2 ? 3n + 2)=2 FAs. Since the OVDs have less delay than the MFAs, they are not on the critical path. The worst case delay of our proposed saturating unsigned array multiplier is approximately equal to the delay through one AND gate, two OR gates, one XOR gate, one HA, and (n ? 2) FAs. Although our proposed technique for unsigned saturating multipliers has slightly more area and delay than the technique presented in [16], it is more easily adapted to designs for two's complement saturating multipliers and combined unsigned and two's complement saturating multipliers, as discussed in the following sections. 3. TWO'S COMPLEMENT SATURATING MULTIPLIERS With two's complement integer multiplication, the values of A, B , and P are A = ?an?1 2n?1 + X ai 2 i n?2 i=0 B = ?bn?1 2n?1 + X bi 2i n?2 i=0 P = ?p2n?1 22n?1 + X pi 2i 2n?2 i=0 (15) The most signicant bit of a two's complement number, called the sign bit, has a negative weight [17]. Figure 4 shows the multiplication matrix for an n-bit two's complement integer multiplication that uses the Complemented Partial Product Word Correction Algorithm [18], [12]. This matrix is identical to the unsigned multiplication matrix, shown in Figure 4, except (2n ? 2) of the partial products are complemented and ones are added to columns n and (2n ? 1). 1 an-1 b0 an-2 b0 an-1 b1 an-2 b1 an-1 bn-2 an-2 b n-2 1 p2n-1 an-1 bn-1 an-2 bn-1 p2n-2 Figure 4. p2n-3 a1 b0 a0 b0 a1 b1 a0 b1 a1 bn-2 a0 bn-2 a1 bn-1 a0 bn-1 pn pn-1 p1 p0 Two's complement multiplication matrix for P = A B If the n least signicant bits of the product are used as the result, overow occurs and the product should be saturated when P < ?2n?1 or P 2n?1 (16) With the conventional method for two's complement multiplication, overow is detected by testing if pn?1 diers from any product bit to the left of it. Thus, overow is computed as Vt = p^2n?1 + p^2n?2 + : : : + p^n+1 + p^n (17) where p^i = pi pn?1 and denotes logical XOR. Detecting overow with this method requires n XOR gates and (n ? 1) OR gates, after the 2n-bit product is computed. For two's complement multiplication, the product saturates to ?2n?1 = 100 : : : 00 when s = an?1 bn?1 = 1 and Vt = 1, and it saturates to 2n?1 ? 1 = 011 : : : 11 when s = an?1 bn?1 = 0 and Vt = 1. If Vt = 0, then the n least signicant bits of the product are returned. Thus, the bits of the saturated two's complement product are computed as < pn?1 > = Vt s + Vt pn?1 < pi > = Vt s + Vt pi (0 i n ? 2) Equations (18) and (19) are implemented by an n-bit 2-to-1 mux that uses Vt as the select signal. (18) (19) Figure 5 shows an 8-bit two's complement array multiplier that uses the Complemented Partial Product Word Correction Algorithm to generate the partial products and the conventional method for saturation. This multiplier is similar to the unsigned multiplier shown in Figure 2, except that (n ? 1) of the AND gates on the left side of the array are replaced by NAND gates, (n ? 1) MFAs toward the bottom of the array are replaced by negating MFAs (NMFAs), one of the half adders is replaced by a specialized half adder (SHA), p2n?1 is complemented, n XOR gates are used at the bottom of the array, and the n OR gates on the right side of the array are replaced by an n-bit 2-to-1 mux. The NMFAs and the NAND gates complement (2n ? 2) partial product bits, as shown in Figure 4. The SHA takes sum and carry bits from the previous row and adds them with one to produce new sum and carry bits. The SHA and the inverter that complements p2n?1 add the ones shown in columns n and (2n ? 1) of Figure 4. The n XOR gates perform p^i = pi pn?1 , for n i 2n ? 1. The n-bit 2-to-1 mux selects the appropriate bits for the saturated product, based on Equations (18) and (19). An n-bit saturating two's complement array multiplier that uses the conventional technique has 2n inverters, n2 AND gates, (n ? 1) OR gates, (n + 1) XOR gates, n HAs, (n2 ? 2n) FAs, and one n-bit 2-to-1 mux. When implemented in CMOS technology, (2n ? 2) of the AND gates and inverters can be combined to form NAND gates. The worst case delay is approximately equal to the delay through one inverter, one AND gate, two OR gates, one XOR gate, one 2-to-1 mux, two HAs, and (2n ? 4) FAs. For two's complement saturating multiplication, our proposed technique only computes p0 to pn . Overow is detected by examining pn pn?1 , along with the number of leading zeros for positive operands and the number of leading ones for negative operands. These leading zeros or leading ones, which have the same value as the sign bit, a6 a5 a4 a3 a2 a1 b0 NAND AND AND AND AND AND AND AND < p0 > b1 NAND MHA MHA MHA MHA MHA MHA MHA cs cs cs cs cs cs cs < p1 > MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs < p2 > MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs < p5 > MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs < p6 > NMFA cs NMFA cs NMFA cs NMFA cs NMFA cs NMFA cs NMFA cs b2 b3 b4 b5 b6 b7 NAND NAND NAND NAND NAND XAND s FA c s p15 XOR p15 OR p14 XOR p14 OR FA c s p13 XOR p13 OR FA c s p12 XOR p12 OR FA c s p11 XOR p11 OR FA c s p10 XOR p10 OR a0 FA c s 8-bit 2-to-1 mux a7 < p3 > < p4 > < p7 > SHA c s p9 p8 XOR XOR p9 p8 OR Vt Figure 5. Conventional 8-bit two's complement saturating array multiplier. are referred to as leading bits. For example, 00001101 has four leading bits, and 11101010 has three leading bits. If A has LA leading bits and B has LB leading bits, then the magnitudes of A, B , and P are bounded by 2n?LA?1 j A j 2n?LA 2n?LB ?1 j B j 2n?LB 22n?LA?LB ?2 j P j 22n?LA?LB (20) (21) (22) As noted in (16), overow occurs and the product should be saturated if P < ?2n?1 or P 2n?1. When P is the most negative representable number, ?2n?1 , saturating the product to the most negative number produces the correct product, even though overow has not occurred. Thus, the nal product can be saturated whenever j P j 2n?1 . This approach makes it easier to the determine the maximum total number of leading bits for which the product can always be saturated. As noted in [19], our technique for two's complement saturation, also works for two's complement overow detection without saturation. Since the left side of (22) provides a lower bound on the product, the product of the two's complement multiplication should be saturated if 2n?1 22n?LA?LB ?2 j P j (23) or equivalently if LA + L B n ? 1 (24) Thus, the product should be saturated, if A and B together have fewer than n leading bits. The right size of (22), gives the upper bound (25) j P j 22n?LA?LB If A and B together have n or more leading bits (i.e., n LA + LB ), then (25) gives j P j 22n?LA?LB 2n (26) b1 a6 a5 a4 a3 a2 a1 NAND AND AND AND AND AND AND NAND a7 b2 MHA MHA MHA MHA MHA MHA cs cs cs cs cs < p1 > b1 XOR cs A2X MFA MFA MFA MFA MFA MFA cs cs cs cs b2 XOR cs < p2 > a6 a5 A2X Tw MFA MFA MFA MFA cs cs cs cs a4 A2X pl om sc o’ s MFA MFA MFA cs cs cs a3 t en em b5 A2X s MFA MFA cs cs a2 ov fl er A2X ow b6 s MFA cs MFA cs MFA cs cs MFA cs NA2X s NMFA cs < p4 > < p5 > MFA cs a1 < p3 > b3 b4 b5 < p6 > b6 < p7 > b7 XOR XOR XOR XOR a6 a5 a4 a3 a2 a1 XOR XOR XOR XOR XOR XOR a6 b1 a5 a4 a3 a2 p7 p8 a1 AND o3 b2 v3 OVD o4 b3 v4 OVD o5 b4 v5 OVD o6 b5 v6 OVD o7 b6 XOR v7 v’’ t OVD v8 OR vt ct te de io n b7 < p0 > AND cs s b4 a7 MHA s b3 a0 8-bit 2-to-1 mux b0 a7 un s XOR it Vt Two’s complement overflow detection unit XOR s p8 s p7 Figure 6. Proposed 8-bit two's complement saturating array multiplier. In this case, P can always be represented as a two's complement integer with (n + 1) bits, and overow occurs if and only if pn pn?1 = 1. Thus, for two's complement multiplication, overow occurs and the product saturates if and only if LA + LB n ? 1 or pn pn?1 = 1. A more formal proof of this statement is presented in [19]. Our proposed technique for two's complement overow detection is similar to the technique presented in [20], however, our technique uses fewer gates and can be implemented eciently for any value of n. A two's complement overow detection unit is used to determine if the product should be saturated. The condition given in (24) is detected as n?2 i+1 a^n?j ^bi (27) Vt0 = a^n?2 ^b1 + XX i=2 j=2 where a^i = ai an?1 , ^bi = bi bn?1 , and Vt0 = 1 when (24) is true. As was done for unsigned saturating multiplication, the hardware needed to detect overow is reduced by taking advantage of common terms in Equation (27). With this approach, overow is detected using the iterative equations oi+1 = oi + a^n?i vi+1 = vi + oi+1 ^bi (28) (29) for 3 i n ? 1. Initially, o3 = a^n?2 and v3 = a^n?2 ^b1 . After (n ? 3) iterations of Equations (28) and (29), the overow ag is computed as Vt = vn + pn pn?1 , where vn is equivalent to Vt0 from Equation (27). An 8-bit saturating two's complement array multiplier that uses our proposed technique is shown in Figure 6. This multiplier is similar to the proposed saturating unsigned multiplier, shown in Figure 3, except the two AND gates in the upper left corner are replaced by NAND gates, the A2X and MFA in the bottom right corner are replaced by a negating A2X (NA2X) and a negating MFA (NMFA), the unsigned overow detection unit is replaced by a two's complement overow detection unit, and the n OR gates on the right side of the array are replaced by an n-bit 2-to-1 mux. The two's complement overow detection unit uses (2n ? 2) XOR gates to compute a^1 to a^n?2 , ^b1 to ^bn?2 , s = an?1 bn?1, and Vt00 = pn pn?1 , one AND gate to compute v3 = a^n?2 b^1 , (n ? 3) OVD cells to compute o4 to on?1 and v4 to vn , and one OR gate to compute Vt = vn + Vt00 . t an-1 b0 an-2 b0 a1 b0 a0 b0 an-1 b1 an-2 b1 a1 bn-2 a0 bn-2 an-1 bn-2 an-2 b n-2 t an-1 bn-1 an-2 bn-1 p2n-1 Figure 7. p2n-2 p2n-3 a1 b1 a0 b1 a1 bn-1 a0 bn-1 pn pn-1 p1 p0 Combined unsigned and two's complement multiplication matrix for P = A B An n-bit two's complement saturating array multiplier that uses our proposed technique requires ve inverters, (n2 + 5n ? 6)=2 AND gates, (2n ? 5) OR gates, (4n ? 5) XOR gates, (n ? 1) HAs, (n2 ? 3n + 2)=2 FAs, and one n-bit 2-to-1 mux. The worst case delay of our proposed two's complement saturating array multiplier is approximately equal to the delay through one inverter, one AND gate, two OR gates, two XOR gates, one 2-to-1 mux, one HA, and (n ? 2) FAs. 4. COMBINED SATURATING MULTIPLIERS Since the unsigned and two's complement saturating multipliers presented in the previous sections have very similar structures, a single multiplier can be designed that performs either unsigned or two's complement saturating integer multiplication, based on an input control signal. This control signal, t, is one when two's complement multiplication is performed and zero when unsigned multiplication is performed. Figure 7 shows the combined unsigned and two's complement multiplication matrix for P = A B . In this gure, ai bj = (ai bj ) t. When t = 1, the (2n ? 2) partial product bits that use either an?1 or bn?1, but not both, are complemented and ones are added to columns n and (2n ? 1), which corresponds to the two's complement multiplication matrix shown in Figure 4. When t = 0, the partial product bits are left unchanged and no ones are added, which corresponds to the unsigned multiplication matrix shown in Figure 1. The conventional method for combined unsigned and two's complement saturating multiplication detects overow by examining pn to p2n?1 when unsigned multiplication is performed, and by examining p^n to p^2n?1 when two complement multiplication is performed. This is accomplished by computing d Vc = p2n?1 + p2n?2 + : : : + pn+1 + pn (30) where pi = pi w, w = t pn?1 , and Vc is one when overow occurs. For combined multiplication, the product saturates to 111 : : : 11 when Vc = 1 and t = 0; to 100 : : : 00 when Vc = 1, an?1 bn?1 = 1, and t = 0; and to 011 : : : 11 when Vc = 1, an?1 bn?1 = 0, and t = 0. If Vc = 0, then the n least signicant bits of the product are returned. Thus, the bits of the saturated product are computed as < pn?1 > = Vc q1 + Vt pn?1 < pi > = Vc q2 + Vt pi (0 i n ? 2) where q1 = an?1 bn?1 + t and q2 = an?1 bn?1 + t. (31) (32) Figure 8 shows the block diagram of an combined 8-bit array multiplier that performs conventional saturating multiplication. This gure is identical to the two's complement multiplier shown in Figure 5, except that (n ? 1) NAND gates are replaced by conditional AND (CAND) gates, (n ? 1) NMFAs are replaced by conditional MFAs (CMFAs), the SHA is replaced by a FA, the XAND cell is replaced by an AND-XOR-INV-OR (AXIO) cell, one inverter is changed to an XOR gates, and one AND gate is added to second row from the bottom of the array. Each CAND gate contains one AND gate and one XOR gate that compute an?1 bi = (an?1 bi ) t, which complements d a6 a5 a4 a3 a2 a1 b0 CAND AND AND AND AND AND AND AND < p0 > b1 CAND MHA MHA MHA MHA MHA MHA MHA cs cs cs cs cs cs cs < p1 > MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs < p2 > MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs MFA MFA MFA MFA MFA MFA MFA cs cs cs cs cs cs cs CMFA cs CMFA cs CMFA cs CMFA cs CMFA cs CMFA cs CMFA cs b2 b3 b4 b5 b6 b7 t CAND CAND CAND CAND a0 8-bit 2-to-1 mux a7 < p6 > q1 AXIO XOR p15 XOR p15 OR FA c s p14 XOR p14 OR FA c s p13 XOR p13 OR FA c s p12 XOR p12 OR FA c s p11 XOR p11 OR FA c s p10 XOR p10 OR FA c s p9 XOR p9 t < p4 > < p5 > q2 CAND < p3 > < p7 > q FA c s p8 t p 7 w AND XOR p8 OR Vc Figure 8. Conventional 8-bit combined unsigned and two's complement saturating array multiplier. an?1 bi when t = 1. Similarly, each CMFA contains an XOR gate that complement ai bn?1 when t = 1. The AXIO cell computes t, q1 , and q2 . The FA in the bottom right corner of the array adds the t in column n, and the XOR gate in the third row from the bottom adds the t in column 2n ? 1. The AND gate in the second row from the bottom computes w = t pn?1 . An n-bit combined saturating array multiplier that uses the conventional technique has 2 inverters, (n2 + 1) AND gates, n OR gates, 3n XOR gates, (n ? 1) HAs, (n2 ? 2n + 1) FAs, and one n-bit 2-to-1 mux. The worst case delay is approximately equal to the delay through one AND gate, one OR gates, three XOR gates, one 2-to-1 mux, one HAs, and (2n ? 3) FAs. Our proposed technique for combined saturating multiplication takes advantage of common terms in the overow detection equations for unsigned and two's complement multiplication. Since the unsigned overow detection equations use ai and bi and the two's complement overow detection equations use a^i and b^i , these equations are unied by dening bi = bi (t bn?1 ) ai = ai (t an?1 ) and (33) for 1 i n ? 2. This gives ai and bi when t = 0, and a^i and ^bi when t = 1. Removing common terms from Equations (12) and (27), gives n?2 i an?j bi (34) Vc0 = XX i=2 j=2 With our technique, a preliminary combined overow ag is computed using the iterative equations oi+1 = oi + an?i (35) vi+1 = vi + oi+1 bi (36) for 3 i n ? 2. Initially, o3 = an?2 and v3 = an?2 b2 . After (n ? 4) iterations of Equations (35) and (36), the preliminary overow ag is Vc0 = vn?1 . b1 b2 a6 a5 a4 a3 a2 a1 CAND AND AND AND AND AND AND CAND XOR XOR MFA < p2 > b2 XOR cs A2X MFA MFA MFA cs cs cs MFA MFA MFA MFA cs cs cs cs MFA MFA MFA cs cs cs cs MFA cs MFA cs bi m d ne ov MFA MFA cs cs < p3 > < p4 > < p5 > MFA cs q2 w flo er io ct te de A2X s MFA cs < p6 > MFA cs n q1 un it b7 XOR cs cs b6 XOR MFA cs s XOR cs cs A2X AND MFA cs b5 a2 cs MHA s a3 XOR MHA A2X a4 b1 MHA Co b4 a5 < p1 > MHA s t a6 cs MHA A2X < p0 > AND a7 MHA MHA s b3 a0 8-bit 2-to-1 mux b0 a7 CA2X s b4 b5 b6 < p7 > CMFA cs Vc t b3 XOR XOR XOR XOR AND b4 b5 b6 a3 a2 XOR a1 AND OR OR OA AND o3 v3 OR OA OVD o4 v4 OVD o5 OR OA v5 OA OVD o6 v6 OR OVD r’u OR o7 OA v7 r t a1 Combined overflow detection logic vc q1 q2 Combined overflow detection unit s Figure 9. b3 a4 b7 XOR p8 b1 b2 a5 a6 a1 p7 p8 p7 Proposed 8-bit combined unsigned and two's complement saturating array multiplier. The remaining terms for unsigned overow detection not included in Equation (34) are ru = (bn?1 where X ai) + (an?1 nX?2 bi) = bn?1 (a1 + on?1) + an?1 ru0 n?2 i=1 i=1 ru0 = X bi n?2 i=1 (37) (38) The remaining terms for two's complement overow detection not included in Equation (34) are rt = X ai bn?i?1 n?2 i=1 The nal overow ag is computed as Vc = Vc0 + (rt + pn pn?1 ) t + (bn?1 (a1 + on?1 ) + an?1 ru0 + pn ) t (39) (40) An 8-bit combined saturating array multiplier that uses our proposed technique is shown in Figure 9. This multiplier is similar to the proposed saturating two's complement multiplier, shown in Figure 6, except the two NAND gates in the upper left corner are replaced by CAND gates, the NA2X and NMFA in the bottom right corner are replaced by a complementing A2X (CA2X) and a complementing MFA (CMFA), and the unsigned overow detection unit is replaced by a combined overow detection unit. The combined overow detection unit uses (2n ? 4) XOR gates and two AND gates to compute a1 to an?2 , b1 to bn?2 , an AND gate to compute v3 , (n ? 4) OVD cells to compute o4 to on?1 and v4 to vn?1 , (n ? 2) OR gates to compute ru0 , an AND gate and (n ? 3) OR-AND (OA) cells to compute rt , and combined overow detection logic to compute Vc , q1 , and q2 . The combined overow detection logic requires two inverters, four AND gates, seven OR gates, and two XOR gates. An n-bit combined saturating array multiplier that uses our proposed technique requires two inverters, (n2 +7n)=2 AND gates, (3n ? 2) OR gates, (4n ? 1) XOR gates, (n ? 1) HAs, (n2 ? 3n +2)=2 FAs, and one n-bit 2-to-1 mux. The worst case delay of our proposed combined saturating array multiplier is approximately equal to the delay through one inverter, two AND gates, two OR gates, three XOR gates, one 2-to-1 mux, one HA, and (n ? 2) FAs. Table 1. Component counts for n-bit saturating array multipliers. Multiplier type INV AND Conventional unsigned 0 n2 Proposed unsigned 0 (n2 + 5n ? 6)=2 Conventional signed 2n n2 2 Proposed signed 5 (n + 5n ? 6)=2 Conventional combined 2 n2 + 1 2 Proposed combined 2 (n + 7n)=2 Table 2. Number of components OR XOR HA FA Mux 2n ? 1 0 n n2 ? 2n 0 3n ? 3 2n ? 3 n ? 1 (n2 ? 3n + 2)=2 0 n?1 n+1 n n2 ? 2n n 2 2n ? 5 4n ? 5 n ? 1 (n ? 3n + 2)=2 n n 3n n ? 1 n2 ? 2n + 1 n 3n ? 2 4n ? 1 n ? 1 (n2 ? 3n + 2)=2 n Components on the critical delay path for n-bit saturating array multipliers. Number of components Multiplier type INV AND OR XOR HA FA Mux Conventional unsigned 0 1 3 0 2 2n ? 4 0 Proposed unsigned 0 1 2 1 1 n?2 0 Conventional signed 1 1 2 1 2 2n ? 4 1 Proposed signed 1 1 2 2 1 n?2 1 Conventional combined 0 1 1 3 1 2n ? 3 1 Proposed combined 1 2 2 3 1 n?2 1 Table 3. Area and delay estimates for 32-bit saturating multipliers. Area in equivalent gates Delay in nanoseconds Multiplier type Conventional Proposed Savings Conventional Proposed Savings Unsigned 11934 6352 46.8% 51.34 26.46 48.5% Signed 12129 6571 45.8% 52.01 27.41 47.2% Combined 12265 6720 45.2% 52.93 28.20 46.7% 5. RESULTS AND CONCLUSIONS Tables 1 and 2 summarize the total number of components and the number of each component on the critical delay path for n-bit saturating array multipliers. Table 1 shows that the proposed techniques reduce the number of AND gates and FAs required to implement saturating multipliers by nearly a factor of two for large values of n. Since these components dominate the area of array multipliers, this leads a signicant a signicant reduction in area. Table 2 shows that the proposed techniques reduce the number FAs on the critical delay path by about a factor of two, with little or no increase in the number of the other components. Table 3 gives area and delay estimates for 32-bit saturating multipliers. Area and delay estimates were based based on data from LSI Logic's 0.6 micron LCA300K gate array library and the Leonardo synthesis tool from Exemplar Logic. The estimates given assume a nominal operating voltage and temperature of 5.0 Volts and 25 C, respectively. From these estimates, 32-bit saturating multipliers implemented using our proposed techniques have between 45.2% and 46.8% less area and between 46.7% and 48.5% less delay than saturating multipliers that use conventional methods. Modifying our proposed 32-bit unsigned saturating array multiplier so that it performs both unsigned and two's complement saturating multiplication, increases the area by only 5.8% and the delay by only 6.6%. The proposed unsigned, two's complement, and combined saturating array multipliers presented in this paper have about half as much area and delay as saturating array multipliers that use conventional methods. These decreases in area and delay occur because the hardware that computes the (n ? 1) most signicant product bits is replaced by simple overow detection units. These multipliers should also dissipate signicantly less power due to their decreased hardware requirements. The techniques presented in this paper can easily be adapted to designs for multipliers that use other structures or to multipliers that support both saturating and wraparound arithmetic. ACKNOWLEDGMENTS This material is based upon work supported by Sandia National Laboratories under Grant Number BF-1029 and the National Science Foundation under Grant Number MIP-9703421. Any opinions, ndings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reect the views of Sandia National Laboratories or the National Science Foundation. REFERENCES 1. A. Peleg and U. Weiser, \MMX Technology Extension to the Intel Architecture," IEEE Micro 6, pp. 42{50, August 1996. 2. F. Mintzer and A. Peled, \A Microprocessor for Signal Processing, the RSP," IBM Journal of Research & Development 26, pp. 413{423, July 1982. 3. N. Yadav, M. J. Schulte, and J. Glossner, \Parallel Saturating Fractional Arithmetic Units," in Proceedings of the Ninth Great Lakes Symposium on VLSI, pp. 214{217, March 1999. 4. K. Diefendor, P. Dubey, R. Hochsprung, and H. Scale, \AltiVec Extension to PowerPC Accelerates Media Processing," IEEE Micro 20(2), pp. 85{95, 2000. 5. M. Tremblay, \VIS Speeds New Media Processing," IEEE Micro 16(4), pp. 10{20, 1996. 6. R. Lee, \Accelerating Multimedia with Enhanced Processors," IEEE Micro 15(2), pp. 22{32, 1995. 7. P. Lapsley, DSP Processor Fundamentals: Architectures and Features, IEEE Press, 1997. 8. C. R. Baugh and B. A. Wooley, \A Two's Complement Parallel Array Multiplication Algorithm," IEEE Transactions on Computers C-22, pp. 1045{1047, December 1973. 9. K. Z. Pekmestzi, \Multiplexer-Based Array Multipliers," IEEE Transactions on Computers C-48, pp. 15{23, January 1999. 10. L. Dadda, \Some Schemes for Parallel Multipliers," Alta Frequenza 34, pp. 349{356, 1965. 11. C. S. Wallace, \Suggestion for a Fast Multiplier," IEEE Transactions on Electronic Computers EC-13, pp. 14{ 17, 1964. 12. K. Bickersta, M. J. Schulte, and E. E. Swartzlander, Jr., \Parallel Reduced Area Multipliers," Journal of VLSI Signal Processing 9, pp. 181{192, 1995. 13. A. D. Booth, \A Signed Binary Multiplication Technique," Quarterly Journal of Mechanics and Applied Mathematics 4, pp. 236{240, 1951. 14. H. Sam and A. Gupta, \A Generalized Multibit Recoding of Two's Complement Binary Numbers and Its Proof with Application in Multiplier Implementations," IEEE Transactions on Computers 39(8), pp. 1006{1015, 1990. 15. P. J. Song and G. D. Micheli, \Circuit and Architecture Trade-os for High-Speed Multiplication," IEEE Journal of Solid-State Circuits 26, pp. 1184{1198, September 1991. 16. M. J. Schulte, P. I. Balzola, A. Akkas, and R. W. Brocato, \Integer Multiplication with Overow Detection or Saturation," IEEE Transactions on Computers 49(7), 2000. 17. I. Koren, Computer Arithmetic and Algorithms, Brookside Court Publishers, 1998. 18. J. A. Gibson and R. W. Gibbard, \Synthesis and Comparison of Two's Complement Parallel Multipliers," IEEE Transactions on Computers C-24, pp. 1020{1027, October 1975. 19. M. Gok, \Integer Multiplication with Overow Detection or Saturation," Master's thesis, Lehigh University, 19 Memorial Dr. West, Bethlehem, PA, 18015, 2000. 20. Y. Baydatch and R. Hasharon, \Overow Detection for Integer Multiply Instruction" Patent No. 5,801,978, September, 1998.