Download I. Introduction

Low Area CMOS Multiplication Using Booth Algorithm for IEEE 754 Floating Point Standard Hyeong Seok Yu*, Jun Dong Cho School of Elect. & Comp. Eng. Sungkyunkwan University, Suwon 440-746 To reduce macro cell area of multiplier, we propose a new low-area Booth encoder for IEEE 754 floating point standard. Main idea is to adapt Booth encoding to unsigned number multiplication with eliminating the sign-extension and to design a new encoder cell using CPL for faster operation that compensates the delay increase compared to the 4-to-2 compressor. The presented algorithm is verified using VHDL and we show that our encoder cell operates faster than the conventional one from SPICE simulation. I. Introduction is devoted to the IEEE 754 floating point standard representation. In Section IV, we present a proposed Multiplier is often used in digital signal processing and encoding method and its transistor-level implementation. plays an important role in digital systems. As the data Section V and VI draw our experimental result and increase, we need faster and smaller multiplier to be used for conclusion. ASIC macro cell. For these requirements of smaller area occupation and faster operation, Booth’s algorithm is II. Booth’s Encoding practically used [2][3]. This encoding algorithm is suitable for 2’s complementary and signed number multiplication. The widely used multiplier algorithms are Braun, Baugh- Booth’s algorithm also requires redundant partial product Wooley and Booth Multiplier. Two effective structures are generations, so-called sign-extension. adder array and tree structures. Booth multiplier is effective In multimedia and signal processing, the floating-point to reduce the multiplier area. That is, it reduces the number data operation is required frequently. The floating-point of partial product to be added by factor of 2 and needs less representation has various formats and one of them is IEEE 4-to-2 compressors when Wallace tree is used. 754 floating-point standard that is widely used [3]. In the First, we review the modified Booth’s algorithm [2][4][5]. case of single precision this standard has a positive integer It is based on the 2’s complementary operation and radix-4. number portion. In this paper we propose a new method of reducing area in One of two multiplier inputs, Y, can be written in the 2’s complementary format as: Booth’s encoding suitable for IEEE 754 floating point standard. This paper is organized as follows. Section II reviews Booth’s encoding algorithm. Section III n2 Y   y n 1 2 n 1   y i 2 i (1) i 0 Equation 1 shows that Y is 2’s complementary number having a range of 2n-1-1 to –2n-1. Booth encoding can be Equation 3 represents the Single precision format. S further improved for positive number because IEEE 745 represents a sign in 1 bit, M mantissa (unsigned integer floating-point standard does not require 2’s complementary portion) in 23 bits and E exponent in 8 bits. H is a hidden bit numbers. and is not recoded in the data format. This standard We can rewrite equation 1 as follow: represents various numbers such as normal number, 0, positive and negative infinite number and number nearest to Y  n / 2 1  (Y2i 1  Y2i  2Y2i 1 )  2 2i (2) ‘0’(< 1X2-127). In each case, H is assumed ‘0’ or ‘1’. Thus, for i 0 multiplying integer portion of single precision representation, we consider a positive 24-bit binary number, Equation 2 signifies that the modified Booth’s encoding which is not 2’s complementary number. partitions input Y into a group of 3-bits with 1-bit overlap In this paper, we propose a new method of Booth’s and generates the following five signed digits, 2, 1, 0, -1 and encoding method complying with these requirements to -2. Encoding on the each group reduces the number of design in the transistor-level circuit. partial products by factor of 2. Operations on the encoded digits performed with IV. Proposed Encoding Method multiplier input X is illustrated in Table 1. To adapt Booth’s algorithm to the unsigned number, we Table 1. Partial Product Selections and Operations Recoded digit 0 +1 +2 -1 -2 Booth’s operation on X Add 0 to PP Add X to PP Shift X left & add to PP Add 2’s complementary X to PP 2’s complementary X & shitf-add Y2I-1 Y2I 2Y2I+1 {0 0 0, 1 1 1} {0 0 1, 0 1 0} {0 1 1} need to compensate the MSB term by adding an additional term to equation 1: n 1 Y   y n 1 2 n 1   y i 2 i  2 y n 1 2 n 1 (4) i 0 {1 0 1, 1 1 0} In equation 4, the third additional term is needed to be {1 0 0} added with partial products in the partial product generation. This defect of adding a new term can be ignored in the III. IEEE 754 Floating Point Standard compressing stage if we use the following tree or array structure. We will illustrate in following section how the IEEE 754 Standard provides the representation of floating redundant inputs of 4-to-2 compressor removes this defect. point in binary code and has two kinds of format such as Next, we must consider extra-generation of partial product, single precision and double precision. In this paper, we so called sign-extension. The sign-extension increases the target single precision. Single precision format consists of 32 bit-length of partial product vector that is input X operated bits in total. by recoded digit. Although Booth encoding reduces the number of rows of generated partial product vector, the (1) s  H .M  2 E excess (3) number of partial products is increased up to about 50%, because of the sign-extension. It also reduces the efficiency S 1b M 23b E 8b of using Booth encoding for low area occupation. To illustrate sign-extension generation, we assume two binary numbers A and B. Both A and B are represented by generated 2m terms. X represents partial product generated the following polynomials: by Booth’s operation in Table 1. U represents the position A B n 1 that 2m term may be generated. In the case when input X is i 0 operated by negative recoded digit, one of two U’s in the  ai 2 i same row has the value of ‘1’ and the other has ‘0’. As m 1 b k 0 k 2k shown in Figure 1, the 2 m terms generated in the different m 1 B  1  2 m   bk 2 k (2’s complement of B) (5) rows is not overlapped. Thus, we consider the generated 2 m terms and other ‘0’ values as a bit vector. This example is k 0 also shown in Figure 1. Equation 5 describes a 2’s complement of binary number. In conclusion, we can eliminate the bit vector by adding Thus a subtraction of binary number, Z=A-B, using 2’s the sign-extension elimination vector, which is an inverted complementary, can be written as: array of each non-overlapped recoded digits of –1, -2 in one row. n 1 m 1 i 0 k 0 Z  A  B  1  2 m   ai 2 i   bk 2 k (6) 2 2 i-1 C 1 m, In equation 6, the first term, 2 can be ignored if A and B have the same number of bits (n = m). As shown above, sign-extended partial products should have the same length B 2 i 0 of the largest length for the addition of correct partial product. Previously [2] replaced sign-extended bit with two -1 overhead bits to eliminate the sign-extension. But this -2 method is not suitable for unsigned number multiplication and is slower because two overhead bits in the lower row is 2 i+1 A 1 CPL delay determined by the one in the upper row. In this paper, we propose a method suitable for unsigned 1 gate delay Figure 2. Proposed Encoder Cell number multiplication and each overhead is generated independently and simultaneously. Main idea is as follows. When the subtraction using 2’s complementary number must Figure 2 shows a proposed encoder cell. The functionality in figure 2 can be derived by: be performed due to negative recoded digit, the 2 m term in equation 6 is not overlapped by the 2 m terms of other row’s  2  A B C partial product as shown in figure 1. 1  A  (B  C) U  U U  U U X  U U X  U U X X  U U X X  U X X X X X X X X X Figure 1. Partial Products and the 2m Terms  0  A B C  A B C  1  A  (B  C)  2  A B C Above Functional expression indicates the relation between encoder input and output. The proposed encoder Figure 1 illustrates partial product vectors and possibly cell uses Complementary Pass Transistor Logic (CPL) for 0 , x23 0 y01 y02 E PP pp00,23 y02 y03 y04 E PP pp01,23 y04 y05 y06 E PP pp02,23 y20 y21 y22 E PP pp10,23 y22 y23 y24 E PP pp11,23 y24 0 0 E PP pp12,23 x23,x22 PP pp00,22 PP pp01,22 PP pp02,22 PP pp10,22 PP pp11,22 PP pp12,22 x22,x21 PP pp00,21 PP pp01,21 PP pp02,21 PP pp10,21 PP pp11,21 PP pp12,21 x2 , x1 PP pp00,02 PP pp01,02 PP pp02,02 PP pp10,02 PP pp11,02 PP pp12,02 x1 , x0 PP pp00,01 PP pp01,01 PP pp02,01 PP pp10,01 PP pp11,01 PP pp12,01 x0 , 0 PP pp00,00 PP pp01,00 PP radix-1 00 0 y01 y02 E y02 y03 y04 E y04 y05 y06 E y18 y19 y20 E y20 y21 y22 E y22 y23 y24 E radix-2 00 radix-1 01 radix-2 01 radix-1 02 radix-2 02 pp02,00 PP pp10,00 PP pp11,00 radix-1 10 radix-2 10 radix-1 11 radix-2 11 radix-1 12 radix-2 12 PP pp12,00 inv inv inv inv inv inv v23,v22 v21,v20 v19,v18 v05, v04 v03,v02 v01, v00 (b) sign-extension elimination vector (a) encoder cell & PP generator array Figure 3. Partial Product Arrangement Strategy and Sign-Extension Eliminating Vector Generation easy implementation of XOR gate, so that it has only one transistor and gate delay. Thus, it is faster than conventional xi-1 encoder cell [1,2,5]. The outputs –1, -2 are used for Xi generating the sign-extension-eliminating vector. Figure 3 illustrates partial product generation array and signextension elimination vector generation. As shown in figure 2 -2 1 -1 3a, input X and Y are input to Booth’s encoder (depicted as 2 -2 1 -1 E) and partial product generator (depicted as PP), respectively. Recoded digits through Booth’s encoder cell 0 are supplied to partial product generator. Finally, we have 14 partial product vectors in the case of 24X24 multiplication. It consists of 13 vectors generated by Booth’s encoding and 1 compensating vector of the third term described in equation PP Figure 4. Partial Product Generator 4. Figure 3b describes how simply sign-extensionelimination vector can be generated. This vector is the array Figure 4 shows the partial product generator cell. It inputs of inverted encoder outputs, {-1, -2} and is generated the five outputs of Booth’s encoder cell and two adjacency simultaneously with partial products and independently with bits of multiplier input X and it is similar to the conventional each other encoder outputs. This simple and independent structure. It also uses four parallel-connected CPLs for generation scheme differs from previous sign extension concurrent selection of partial product and one gate for load elimination scheme by sequentially and correlatively driving capability and glitch reduction using gate resizing. generated overheads in [2]. This elimination vector is divided to 2 bits and added on each partial product row as overhead. V. Experimental Result stage indicate delay of the conventional encoder proposed by Ohkubo [1]. This result shows that proposed Booth’s encoder using CPL operates faster than the conventional encoder using gate-only structure. We performed this simulation in the 0.8-µm technology of Hyundai Electronics Industries Co., Ltd. Table 2. Summarization of Simulation Results No. of PP No. of TR Delay Conventional 452 Proposed 312 Reduction rate 23% 12960 6425 49.7% 2.37ns 1.3ns 45% Figure 5. VHDL Function Verification using Altera™ Table 2. summarizes the comparisons between Our experiments used Altera™ and HSpice™. First, we conventional and proposed Booth encoding. The number of verified the proposed Booth’s encoding algorithm in the RT- Partial Products is compared between conventional Booth’ Level. Figure 5 shows the VHDL function test result using encoder cell using gate-only structure and proposed using Altera™ in the Figure 3a, according to the changes of CPL. The number of transistor is compared between the first multiplier inputs X and Y. This result is represented in digit. compressing step of the conventional Wallace tree-only In figure 5., P00~P12 indicate partial product vectors and structure and Booth encoding of Figure 3. (a). In the last Radix 0~Radix 4 and “ze” indicate recoded digits, which comparison, we used 4-to-2 compressor proposed by also are represented in digit. Ohkubo [1]. Delay estimation is the same with results in Figure 6. A rea O ccup atio n 25000 2956 transistor no. 20000 8820 2956 15000 8820 10000 12960 5000 6425 0 1 C o nventio nal Figure 6. Delay-estimation of Booth’s Encoder 2 P ro p o sed Sign Extension We estimated the delay of Booth’s encoder using HSpice™. Figure 6 shows the simulation result for delay Vector estimation. In figure 6, the curves of upper stage indicate the proposed Booth’s encoder delay and the curves of lower Figure 7. Comparison of the Number of Used Transistor transistor and transition activities. Figure 7 is the chart for comparison of the number of used transistor between Booth’s encoder-adapted multiplier and Reference conventional Wallace tree-only multiplier. We tested multiplier from Ohkubo[1] which consists of Wallace tree [1]N. Ohkubo et-al "A 4.4 ns CMOS 54 × 54b and 4-to-2 Multiplier Using Pass-Transistor Multiplexor", IEEE compressor array needs the larger number of transistors than Journal of Solid-State Circuits, vol. 30, no.3, Mar. Booth’s encoder array, if we use Booth’s encoding, we can 1995 Carry-Lookahead Adder(CLA). Because reduce the number of used transistors from 12960 to 6425 in [2] A. Bellaouar, M. I. Elmarsy, Low-Power Digital the first compressing stage. In the Second and following VLSI Design - Circuits and System, Kluwer Academic compressing stage, both structures use the same number of Publishers transistors. Finally CLA structure needs 2956 transistors, the [3] J. L Hennessy, D. A. Patterson, Computer same with each other. In total we can reduce the number of Architecture – a Quantitative Approach, second used transistors is reduced about 26.5% in entire multiplier, edition, Morgan Kaufmann Publisher s, Inc. 1996 using Booth’s encoding and sign-extension-elimination. [4] Extra-partial product vector due to compensation of the n-1 term, -2n-1y in equation (1) may increase the compressing L. Ciminiera, P. Montuschi, “Carry-Save Multiplication Schemes Without Final Addition”, IEEE Transaction on Computer, vol. 45, no. 9, Sep. steps if Wallace tree is used. But, in the case of 24 X24-bit 1996 multiplying, redundant compressor inputs are used to [5] B. S. Cherkauer, E. G. Friedman, “A Hybrid process these extra-partial product vector as follows: Radix-4/Radix-8 Low Power Signed Multiplier Architecture”, IEEE Transaction on Circuits and Conventional Proposed (4,4,4) (4,4,4)   (4,2) (4,2,2)   (4) (4)   (2) (2) V. Conclusion In this paper, we proposed a new unsigned multiplying method for reducing area. Main idea is to use Booth’ algorithm and sign-extension elimination scheme. Our new method can be used effectively for low-area application on the chip and also low-power application due to reduced Systems, vol. 44, no. 8, Aug. 1997 [6] A. Parameswar, H. Hara, T. Sakurai, “A High Speed, Low Power, Swing Restored Pass -Transistor Logic Based Multiply and Accumulate Circuit for Multimedia Applications”, Proceedings of Custom Integrated Circuits Conference, pp. 278-281, 1994

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download I. Introduction