Download I. Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Classical Hamiltonian quaternions wikipedia , lookup

Arithmetic wikipedia , lookup

Bra–ket notation wikipedia , lookup

Elementary arithmetic wikipedia , lookup

Transcript
Low Area CMOS Multiplication Using Booth Algorithm
for IEEE 754 Floating Point Standard
Hyeong Seok Yu*, Jun Dong Cho
School of Elect. & Comp. Eng. Sungkyunkwan University, Suwon 440-746
To reduce macro cell area of multiplier, we propose a new low-area Booth encoder for IEEE 754
floating point standard. Main idea is to adapt Booth encoding to unsigned number multiplication with
eliminating the sign-extension and to design a new encoder cell using CPL for faster operation that
compensates the delay increase compared to the 4-to-2 compressor. The presented algorithm is verified
using VHDL and we show that our encoder cell operates faster than the conventional one from SPICE
simulation.
I. Introduction
is devoted to the IEEE 754 floating point standard
representation. In Section IV, we present a proposed
Multiplier is often used in digital signal processing and
encoding method and its transistor-level implementation.
plays an important role in digital systems. As the data
Section V and VI draw our experimental result and
increase, we need faster and smaller multiplier to be used for
conclusion.
ASIC macro cell. For these requirements of smaller area
occupation and faster operation, Booth’s algorithm is
II. Booth’s Encoding
practically used [2][3]. This encoding algorithm is suitable
for 2’s complementary and signed number multiplication.
The widely used multiplier algorithms are Braun, Baugh-
Booth’s algorithm also requires redundant partial product
Wooley and Booth Multiplier. Two effective structures are
generations, so-called sign-extension.
adder array and tree structures. Booth multiplier is effective
In multimedia and signal processing, the floating-point
to reduce the multiplier area. That is, it reduces the number
data operation is required frequently. The floating-point
of partial product to be added by factor of 2 and needs less
representation has various formats and one of them is IEEE
4-to-2 compressors when Wallace tree is used.
754 floating-point standard that is widely used [3]. In the
First, we review the modified Booth’s algorithm [2][4][5].
case of single precision this standard has a positive integer
It is based on the 2’s complementary operation and radix-4.
number portion.
In this paper we propose a new method of reducing area in
One of two multiplier inputs, Y, can be written in the 2’s
complementary format as:
Booth’s encoding suitable for IEEE 754 floating point
standard.
This paper is organized as follows.
Section II reviews Booth’s encoding algorithm. Section III
n2
Y   y n 1 2 n 1   y i 2 i
(1)
i 0
Equation 1 shows that Y is 2’s complementary number
having a range of 2n-1-1 to –2n-1. Booth encoding can be
Equation 3 represents the Single precision format. S
further improved for positive number because IEEE 745
represents a sign in 1 bit, M mantissa (unsigned integer
floating-point standard does not require 2’s complementary
portion) in 23 bits and E exponent in 8 bits. H is a hidden bit
numbers.
and is not recoded in the data format. This standard
We can rewrite equation 1 as follow:
represents various numbers such as normal number, 0,
positive and negative infinite number and number nearest to
Y 
n / 2 1
 (Y2i 1  Y2i  2Y2i 1 )  2 2i
(2)
‘0’(< 1X2-127). In each case, H is assumed ‘0’ or ‘1’. Thus,
for
i 0
multiplying
integer
portion
of
single
precision
representation, we consider a positive 24-bit binary number,
Equation 2 signifies that the modified Booth’s encoding
which is not 2’s complementary number.
partitions input Y into a group of 3-bits with 1-bit overlap
In this paper, we propose a new method of Booth’s
and generates the following five signed digits, 2, 1, 0, -1 and
encoding method complying with these requirements to
-2. Encoding on the each group reduces the number of
design in the transistor-level circuit.
partial products by factor of 2.
Operations on the encoded digits performed with
IV. Proposed Encoding Method
multiplier input X is illustrated in Table 1.
To adapt Booth’s algorithm to the unsigned number, we
Table 1. Partial Product Selections and Operations
Recoded
digit
0
+1
+2
-1
-2
Booth’s operation on X
Add 0 to PP
Add X to PP
Shift X left & add to
PP
Add 2’s
complementary X to
PP
2’s complementary X
& shitf-add
Y2I-1 Y2I
2Y2I+1
{0 0 0, 1 1 1}
{0 0 1, 0 1 0}
{0 1 1}
need to compensate the MSB term by adding an additional
term to equation 1:
n 1
Y   y n 1 2 n 1   y i 2 i  2 y n 1 2 n 1
(4)
i 0
{1 0 1, 1 1 0}
In equation 4, the third additional term is needed to be
{1 0 0}
added with partial products in the partial product generation.
This defect of adding a new term can be ignored in the
III. IEEE 754 Floating Point Standard
compressing stage if we use the following tree or array
structure. We will illustrate in following section how the
IEEE 754 Standard provides the representation of floating
redundant inputs of 4-to-2 compressor removes this defect.
point in binary code and has two kinds of format such as
Next, we must consider extra-generation of partial product,
single precision and double precision. In this paper, we
so called sign-extension. The sign-extension increases the
target single precision. Single precision format consists of 32
bit-length of partial product vector that is input X operated
bits in total.
by recoded digit. Although Booth encoding reduces the
number of rows of generated partial product vector, the
(1) s  H .M  2 E excess
(3)
number of partial products is increased up to about 50%,
because of the sign-extension. It also reduces the efficiency
S
1b
M
23b
E
8b
of using Booth encoding for low area occupation.
To illustrate sign-extension generation, we assume two
binary numbers A and B. Both A and B are represented by
generated 2m terms. X represents partial product generated
the following polynomials:
by Booth’s operation in Table 1. U represents the position
A
B
n 1
that 2m term may be generated. In the case when input X is
i 0
operated by negative recoded digit, one of two U’s in the
 ai 2 i
same row has the value of ‘1’ and the other has ‘0’. As
m 1
b
k 0
k
2k
shown in Figure 1, the 2 m terms generated in the different
m 1
B  1  2 m   bk 2 k
(2’s complement of B)
(5)
rows is not overlapped. Thus, we consider the generated 2 m
terms and other ‘0’ values as a bit vector. This example is
k 0
also shown in Figure 1.
Equation 5 describes a 2’s complement of binary number.
In conclusion, we can eliminate the bit vector by adding
Thus a subtraction of binary number, Z=A-B, using 2’s
the sign-extension elimination vector, which is an inverted
complementary, can be written as:
array of each non-overlapped recoded digits of –1, -2 in one
row.
n 1
m 1
i 0
k 0
Z  A  B  1  2 m   ai 2 i   bk 2 k
(6)
2
2
i-1
C
1
m,
In equation 6, the first term, 2 can be ignored if A and B
have the same number of bits (n = m). As shown above,
sign-extended partial products should have the same length
B
2 i
0
of the largest length for the addition of correct partial
product. Previously [2] replaced sign-extended bit with two
-1
overhead bits to eliminate the sign-extension. But this
-2
method is not suitable for unsigned number multiplication
and is slower because two overhead bits in the lower row is
2 i+1
A
1 CPL delay
determined by the one in the upper row.
In this paper, we propose a method suitable for unsigned
1 gate delay
Figure 2. Proposed Encoder Cell
number multiplication and each overhead is generated
independently and simultaneously. Main idea is as follows.
When the subtraction using 2’s complementary number must
Figure 2 shows a proposed encoder cell. The functionality
in figure 2 can be derived by:
be performed due to negative recoded digit, the 2 m term in
equation 6 is not overlapped by the 2 m terms of other row’s
 2  A B C
partial product as shown in figure 1.
1  A  (B  C)
U

U
U

U
U
X

U
U
X

U
U
X
X

U
U
X
X

U
X
X
X
X
X
X
X
X
X
Figure 1. Partial Products and the 2m Terms
 0  A B C  A B C
 1  A  (B  C)
 2  A B C
Above Functional expression indicates the relation
between encoder input and output. The proposed encoder
Figure 1 illustrates partial product vectors and possibly
cell uses Complementary Pass Transistor Logic (CPL) for
0 , x23
0
y01
y02
E
PP
pp00,23
y02
y03
y04
E
PP
pp01,23
y04
y05
y06
E
PP
pp02,23
y20
y21
y22
E
PP
pp10,23
y22
y23
y24
E
PP
pp11,23
y24
0
0
E
PP
pp12,23
x23,x22
PP
pp00,22
PP
pp01,22
PP
pp02,22
PP
pp10,22
PP
pp11,22
PP
pp12,22
x22,x21
PP
pp00,21
PP
pp01,21
PP
pp02,21
PP
pp10,21
PP
pp11,21
PP
pp12,21
x2 , x1
PP
pp00,02
PP
pp01,02
PP
pp02,02
PP
pp10,02
PP
pp11,02
PP
pp12,02
x1 , x0
PP
pp00,01
PP
pp01,01
PP
pp02,01
PP
pp10,01
PP
pp11,01
PP
pp12,01
x0 , 0
PP
pp00,00
PP
pp01,00
PP
radix-1 00
0
y01
y02
E
y02
y03
y04
E
y04
y05
y06
E
y18
y19
y20
E
y20
y21
y22
E
y22
y23
y24
E
radix-2 00
radix-1 01
radix-2 01
radix-1 02
radix-2 02
pp02,00
PP
pp10,00
PP
pp11,00
radix-1 10
radix-2 10
radix-1 11
radix-2 11
radix-1 12
radix-2 12
PP
pp12,00
inv
inv
inv
inv
inv
inv
v23,v22
v21,v20
v19,v18
v05, v04
v03,v02
v01, v00
(b) sign-extension elimination vector
(a) encoder cell & PP generator array
Figure 3. Partial Product Arrangement Strategy and Sign-Extension Eliminating Vector Generation
easy implementation of XOR gate, so that it has only one
transistor and gate delay. Thus, it is faster than conventional
xi-1
encoder cell [1,2,5]. The outputs –1, -2 are used for
Xi
generating the sign-extension-eliminating vector. Figure 3
illustrates partial product generation array and signextension elimination vector generation. As shown in figure
2
-2
1
-1
3a, input X and Y are input to Booth’s encoder (depicted as
2
-2
1
-1
E) and partial product generator (depicted as PP),
respectively. Recoded digits through Booth’s encoder cell
0
are supplied to partial product generator. Finally, we have 14
partial product vectors in the case of 24X24 multiplication. It
consists of 13 vectors generated by Booth’s encoding and 1
compensating vector of the third term described in equation
PP
Figure 4. Partial Product Generator
4. Figure 3b describes how simply sign-extensionelimination vector can be generated. This vector is the array
Figure 4 shows the partial product generator cell. It inputs
of inverted encoder outputs, {-1, -2} and is generated
the five outputs of Booth’s encoder cell and two adjacency
simultaneously with partial products and independently with
bits of multiplier input X and it is similar to the conventional
each other encoder outputs. This simple and independent
structure. It also uses four parallel-connected CPLs for
generation scheme differs from previous sign extension
concurrent selection of partial product and one gate for load
elimination scheme by sequentially and correlatively
driving capability and glitch reduction using gate resizing.
generated overheads in [2]. This elimination vector is
divided to 2 bits and added on each partial product row as
overhead.
V. Experimental Result
stage indicate delay of the conventional encoder proposed by
Ohkubo [1]. This result shows that proposed Booth’s
encoder using CPL operates faster than the conventional
encoder using gate-only structure. We performed this
simulation in the 0.8-µm technology of Hyundai Electronics
Industries Co., Ltd.
Table 2. Summarization of Simulation Results
No. of
PP
No. of
TR
Delay
Conventional
452
Proposed
312
Reduction rate
23%
12960
6425
49.7%
2.37ns
1.3ns
45%
Figure 5. VHDL Function Verification using Altera™
Table
2.
summarizes
the
comparisons
between
Our experiments used Altera™ and HSpice™. First, we
conventional and proposed Booth encoding. The number of
verified the proposed Booth’s encoding algorithm in the RT-
Partial Products is compared between conventional Booth’
Level. Figure 5 shows the VHDL function test result using
encoder cell using gate-only structure and proposed using
Altera™ in the Figure 3a, according to the changes of
CPL. The number of transistor is compared between the first
multiplier inputs X and Y. This result is represented in digit.
compressing step of the conventional Wallace tree-only
In figure 5., P00~P12 indicate partial product vectors and
structure and Booth encoding of Figure 3. (a). In the last
Radix 0~Radix 4 and “ze” indicate recoded digits, which
comparison, we used 4-to-2 compressor proposed by
also are represented in digit.
Ohkubo [1]. Delay estimation is the same with results in
Figure 6.
A rea O ccup atio n
25000
2956
transistor no.
20000
8820
2956
15000
8820
10000
12960
5000
6425
0
1
C o nventio nal
Figure 6. Delay-estimation of Booth’s Encoder
2
P ro p o sed
Sign
Extension
We estimated the delay of Booth’s encoder using
HSpice™. Figure 6 shows the simulation result for delay
Vector
estimation. In figure 6, the curves of upper stage indicate the
proposed Booth’s encoder delay and the curves of lower
Figure 7. Comparison of the Number of Used Transistor
transistor and transition activities.
Figure 7 is the chart for comparison of the number of used
transistor between Booth’s encoder-adapted multiplier and
Reference
conventional Wallace tree-only multiplier. We tested
multiplier from Ohkubo[1] which consists of Wallace tree
[1]N. Ohkubo et-al "A 4.4 ns CMOS 54 × 54b
and
4-to-2
Multiplier Using Pass-Transistor Multiplexor", IEEE
compressor array needs the larger number of transistors than
Journal of Solid-State Circuits, vol. 30, no.3, Mar.
Booth’s encoder array, if we use Booth’s encoding, we can
1995
Carry-Lookahead
Adder(CLA).
Because
reduce the number of used transistors from 12960 to 6425 in
[2] A. Bellaouar, M. I. Elmarsy, Low-Power Digital
the first compressing stage. In the Second and following
VLSI Design - Circuits and System, Kluwer Academic
compressing stage, both structures use the same number of
Publishers
transistors. Finally CLA structure needs 2956 transistors, the
[3] J. L Hennessy, D. A. Patterson, Computer
same with each other. In total we can reduce the number of
Architecture – a Quantitative Approach, second
used transistors is reduced about 26.5% in entire multiplier,
edition, Morgan Kaufmann Publisher s, Inc. 1996
using Booth’s encoding and sign-extension-elimination.
[4]
Extra-partial product vector due to compensation of the
n-1
term, -2n-1y
in equation (1) may increase the compressing
L.
Ciminiera,
P.
Montuschi,
“Carry-Save
Multiplication Schemes Without Final Addition”,
IEEE Transaction on Computer, vol. 45, no. 9, Sep.
steps if Wallace tree is used. But, in the case of 24 X24-bit
1996
multiplying, redundant compressor inputs are used to
[5] B. S. Cherkauer, E. G. Friedman, “A Hybrid
process these extra-partial product vector as follows:
Radix-4/Radix-8
Low
Power
Signed
Multiplier
Architecture”, IEEE Transaction on Circuits and
Conventional
Proposed
(4,4,4)
(4,4,4)


(4,2)
(4,2,2)


(4)
(4)


(2)
(2)
V. Conclusion
In this paper, we proposed a new unsigned multiplying
method for reducing area. Main idea is to use Booth’
algorithm and sign-extension elimination scheme. Our new
method can be used effectively for low-area application on
the chip and also low-power application due to reduced
Systems, vol. 44, no. 8, Aug. 1997
[6] A. Parameswar, H. Hara, T. Sakurai, “A High
Speed, Low Power, Swing Restored Pass -Transistor
Logic Based Multiply and Accumulate Circuit for
Multimedia Applications”, Proceedings of Custom
Integrated Circuits Conference, pp. 278-281, 1994