Download 9 - Desy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MULTIPLIER
9.
The multiplier
9.1 Parallel multiplication.
The principle of operation of the parallel multipliers usually utilized in the VLSI
circuits is rather simple. It has origin from the classical algorithm for the product of two
binary numbers.
6  5 = 30
10  9 = 90
1 0 1 0 
1 0 0 1 =

1 0 1 0
0 0 0 0
0 0 0 0
1 0 1 0

1 0 1 1 0 1 0
partial products
1 0 1 0 
1 0 1 1 =

1 1 1 1 0 1 0
1 1 1 0 1 0
0 0 0 0 0
0 1 1 0

1 0 0 1 1 1 1 0
The example reports two multiplications, in one reported to left are considered
operands without sign, whereas in one leftist the operands are represented in notation in
complement to two.
In general, the breakdown of the multiplication of two strings of n bit is a string of
2n1 bit got from the sum of n partial products. The partial product i-esimo, where 0  i 
n 1, is given from the product of the i-esimo bit of the multiplier for the multiplicand,
this digit is displaced of i positions to left. One can distinguish two fases: generation of
the partial products and their sum.
The acceleration of the process of multiplication is based on two main Techniques:
 reduction of the number of the partial products,
 acceleration in the summation of the partial products.
Let’s consider now a theory and methods utilized in the realization of the multiplier
of THE’ALU of TARZAN.
9-1
MULTIPLIER
9.1.1 Reduction of partial products: the Booth coding
The method used for the reduction of the number of partial products is based on a
modification of the known algorithm proposed by A.D.Booth in 1950 (Booth coding,
[Booth]). The modified Booth coding was invented by O. L. Macsorley (1961,
[Macsorley]) and it is described below..
A binary number X = xm1,x m2,...,x0 consisting of m bit represented in notation in
complement to two is considered:
m 2
X=
2m1xm1
+
x 2
i 0
i
i
(9.1)
an equivalent representation of X in reduced form in basis 4 is as following:
m
1
2
X=
d 4
i 0
i
(9.2)
i
The digits di are chosen from the ensemble {2, 1, 0} in agreement with the rules of
the table 9-1:
x2i+1
x2i
x2i1
di
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
2
2
1
1
0
Tabella 9-1,The truth table of the modified Booth encoder.
The elements di are determined in following way. To each step i, (0  i  (m/2)1),
three bits of X, x2i+1,x2i,x2i1, are examined; the correspondent value of di is obtained
afterwards from the table 9-1. Two tricks is necessary to use: ‘0’ must always be
concatenated to the right of X; the second trick is that m must be always even, therefore,
9-2
MULTIPLIER
if necessary, the bit of sign of X must be extended.
For the demonstration of correctness of this procedure see [Rubinfield].
There is a binary sequence Y of n bit. It’s necessary to calculate the product XY. In
the multiplication m bit  n bit, the multiplier number X of m bit has to be coded
(equation (9.2)) utilizing the algorithm just described. The product XY is afterwards
obtained by summing m/2 partial products, Pi = diY, with 0  i  m/2, that can be easy
calculated by means of operations shift and/or negation of the multiplicand Y:
m
1
2
XY =
m
1
2
d 4 Y =  P 4
i
i 0
i
i 0
i
(9.3)
i
To calculate the sum (9.3), the m/2 partial products Pi must be summed so that the
partial product (i+1)-esimo is shifted two positions to the left of the partial product i-esimo
(in the (9.3) the Pi are multiplied by the power i-esima of 4).
Since the partial products are numbers in notation in complement to two, to obtain
correct value of the sum their sign must be extended to the position m+n1 ( this has
been demonstrated with an example at the beginning of the paragraph). In the parallel
multipliers, the additional cost due to extensions of the sign bit, can be considerable. In
order to avoid summation of the extension bits, the following method [Fadavi] can be
used. It’s assumed that all the partial products are negative. In this case, the sum of all
the extension bits, may be expressed as:
m
1
2
 2 m  1
segni =  ((1)2 )4 = 2 (1) 

 3 
i 0
n
i
n
(9.4)
The relationship (9.4) can be interpreted as a fixed number, (1)[(2m1)/3], which
should be added to the sum of the partial products (not extended), starting from the
position n-esima. This number expressed in binary form is equal to 1010101...010111,
where there are exactly (m/2) - 1 zeros. If the partial product Pi = diY is positive, its sign
bit should be simply replaced by ‘1’, in order to suppress the effect of the previously
9-3
MULTIPLIER
supposed negativity of the partial product Pi. The following example illustrates the
multiplication method using modified Booth coding together with the prevention of the
extension of the sign bit.
Example 9-1
Calculating the product X Y where:
X = 010110 (= 22)
Y = 1101
m=6
(= 3) n = 4
Extension bit:
‘0’ for the negativ Pi,
‘1’ per positiv Pi.
Zero added
0 1 0 1 1 0 0
(1) 0 1 1 0
2
2Y
2Y
(0) 1 0 1 0
2
1
(0) 1 1 0 1
Coefficients dI obtained
from the table 9-1
1 0 1 0 1 1
Y
costant
1 1 1 0 1 1 1 1 1 0 = 66
9.1.2 Summation of the partial products
The summation of the partial products in a parallel multiplier is usually executed
using the full-adders (see § 7.1) connected in a tree structure of a carry-save summator
(CSST: carry save summation tree). The summator has a three-dimensional structure in
which a vertical section represents a CSST for a particular position i of the bit (PPST:
partial product summation tree). A vertical section of twenty bit positions has been
proposed by C. S. Wallace (‘63) [Wallace] and is presenteded in figure 9-1.
9-4
MULTIPLIER
20 19 18
a b Cin
Cout S
17 16
15 14 13
12 11 10
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
9
7
6
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
8
5
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
4
3
2
1
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
a b Cin
Cout S
S20
C21
Figure 9-1, Wallace tree.
If in the level j of the tree (the levels are considered growing from the top) the
n 
number of bits is n, then k =   full-adder should be used for the summation; the k
 3
generated carry-signals are sent to the level j + 1 of the tree i + 1, whereas the k bit of
9-5
MULTIPLIER
 2k 
the sum of the level j constitute inputs of the   adders of the level j + 1, together
 3
with the k carry bits originating from the level j of the tree i  1. Since the number of
bits to sum is riduced three times at each level, the depth of the Wallace tree is O(logN),
where N is the initial number of bit. The two output bits Si and Ci+1, represent
respectively the bit i-esimo and the bit (i+1)-esimo of the two resultant numbers sum of
which gives the final result of the multiplication. Utilizing a structure PPST and an
adder of CLA type, it’s possible to perform summation of n partial products in time
Or(log n).
The critical path in PPST can situate entirely inside the CSST of position N, or can
cross several CSST ending in the position N at the center of the PPST, how displayed in
figure 9-2.
Horisontal propagation
N
N-1
... i+1
i
i-1 ...
3
2
1
Vertical propagation
In diraction to the
output adder
CSST of the
position i-esima
posizione.
OUTPUT ADDER (two operands)
Figure 9-2, "Partial Product Summation Tree"
To increase the velocity of the PPST we must not only to reduce the number of levels
in the CSST of position N, but also to assure that all the signals originated in the CSST
of the lower positions (i = 1,..,N-1) do not contribute to the delay of the signal in the
9-6
MULTIPLIER
position N. You can observe in figure 9-2 that the delay of the critical vertical path of
the CSST position i, with i = 1,...,N-1, is not important because it contsists of fewer
levels than the CSST of position N.
The method used to sum the partial products in multiplier of the ALU of TARZAN
is based on the principles of PPST; the essential difference is the replacing the CSST
with an architecture based on particular devices called compressors [Oklobdzija].
A compressor Ci is a combinatorial device that compresses N input lines (column of
bit) in the position i, to two output lines. In addition there are L input lines coming to
compressor to different levels j, and the same number of lines L generated by the
compressor in the same levels j, where L = N  3, what is shown on the figure 9-3.
N
L
compressore
L
2
Figure 9-3
The aim of a compressor is reduction of the number of bits to two. The lines L, both
coming and leaving, are carry signals. It’s extremely important that the lines L at output
Ci may be connected, in the same order, with input lines L of Ci+1 in such a way that the
signals originated by a some level j of Ci may enter in the same level j of Ci+1. That
preservs the order of the signals and imposes a particular structure of interconnections
of the PPST.
In figure 9-4 the schematic diagram of a compressor 4 : 2 is presented.
9-7
MULTIPLIER
i1
Cin
A
i4
i3
i2
(20)
1
MUX 2:1
0
c
(21)
Cout
(21)
Figure 9-4, compressor 4 : 2
A compressor 4 : 2 has 4 input lines i1, i2, i3, i4, that collect the 4 bit that must be
summed, and two output lines s and c, results of the compression. The additional lines
for the carries are Cin and Cout (in this case since N = 4, L = 4  3 = 1 there is only one
level).
From the scheme of figure 9-4 it is possible immediately derive the booleane
expressions of the outputs a, c and Cout:
9-8
MULTIPLIER
a = (i1  C in )  (i2  i3  i4 ) ;
(9.5)
c = (i2  i3  i4 )  (i1  Cin )  (i2  i3  i4 )  (i1  Cin ) ;
(9.6)
Cout = (i2  i3 )  (i2  i4 )  (i3  i4 ) .
(9.7)
It’s easy to notice that the output carry C out doesn’t depend on the input carry Cin: the
carry propagates only from the compressor of position j, which generates it, to the next
one of position j+1 with a delay of a three inputs XOR-gate.
Utilizing the compressors 4:2 and full-adders it is possible to create compressors of
superior order. The compressors 6:2 and the compressors 9:2 are presented in figure
9-6. A normal full-adders can be used as the compressors 3:2 (in agreement with the
definitions in this case the L = 0).
3:2
3:2
3:2
3:2
3:2
4:2
6:2
Figure 9-5, compressors 9:2 and 6:2
A recent study performed by V. G. Oklobdzija and D. Villeger [Oklobdzija], has
demonstrated that the PPST based on compressors 4:2 may be more efficient than the
traditional implementations PPST using CSST of Wallace. The advantages of usage of
9-9
MULTIPLIER
the compressors of superior order (compressors 9:2) are the larger density and a regular
distribution of the connections (wiring), also ((??if the their efficiency not increases
with the increase from the their dimension??)).
9.2 The operations of multiplication of the ALU of TARZAN
In chapter 6 (Table 6-1) it was described that the ALU of TARZAN performs two
different multiplication operations denoted with the mnemonic codes MUL and MULA
on 32 bit binary operands Op1 and Op2 in notation in complement to two.
The result of the operation MUL is composed of the 32 least significant bits of the
actual product Op1 and Op2 of 64 bit; the over-flow condition must be set if the 32
most significant bits of the result not constitute a extension of its sign bit.
The operation MULA produce a result which is obtained by a multiplication and a
concatenation. The multiplication is executed on of the “reduced” operands: as operands
are considered the 24 least significant bits of Op1 and Op2 following representation in
notation in complement to two with 24 bits. The final result is obtained concatenating to
left of the 24 least significant bits of the product, the 8 most significant bits of Op1. The
condition of over-flow must be true if the 48 bit product doesn’t belongs to ensemble of
the numbers valid for the notation in complement to two on 24 bit.
The necessity of a particular operation like MULA is caused by the format of a
global address generated in TARZAN, it consists of two fields, the 8 most significant
bits represent a remote address part (compare § 2.3.1); therefore multiplication of the
global address doesn’t modify this field.
9-10
MULTIPLIER
9.3 The architecture of the multiplier
The schematic of figure 9-6 represents the architecture of the uppest hierarchical
level of the multiplier (entity MUL_TOP). It consists of a three stage pipeline, in the
schematic every stage is outlined. The operations MUL and MULA are executed in the
same pipeline using three clock cycles. How heretofore were shown for the summator,
also the multiplier works continuously: on each cycle it performs an operation MUL or
a MULA depending on the least significant bit C(0) of the microcode, this bit is really
sufficient to distinguish the two operations (Table 6-1).
The block marked with the name “operands preparation” has the aim to yield
homogenous operations MUL and MULA. In the case of the MULA operation the 8
most significant bits of the operands are substituted with 8 bit of extension of the sign
bit. The 8 most significant bits of the operand Op1 are forwarded unchanged to the third
stage through the two 8 bit registers, so they may be copied to the result, how is
foreseen for the operation MULA. The bit of microcode C(0) suffers the same
treatment, it is propagated unchanged till the last stage where work as a control signal
for the selection of the 8 most significant bits of Op1 and for the selection of the overflow signal.
The Booth encoder has the task to generate the 16 (32/2) partial products of the
multiplication Op1 x Op2. His operation is based on the principles described in §9.1.1
and will be described in detail in §9.3.1. Besides the 16 partial products, the encoder
generates a vector of 32 bit containing eventual ‘carry’ originated during the process of
coding.
The 16 partial products are forwarded in groups of four from the Booth encoder to
four partial products compressors (CPP 4:1), producing eight partial sums of 39 bit
each, which are stored in registers to be used in the second stage of the pipeline during
next clock cycle.
9-11
MULTIPLIER
A
B
Code(0)
A (31..24)
OPERANDS PREPARATION
Moltiplicando
Factor
32
32
BOOTH ENCODER
33
33
33
PPC 4:2
2
PPC 4:2
1
33
PPC 4:2
4
PPC 4:2
3
39
39
39
39
39
39
39
39
32
39
39
39
39
39
39
39
39
32
8
1
8
1
8
1
COMPRESSOR 9 : 2
Cost
64
32
64
FINAL COMPRESSOR 3 : 2
65
65
65
32 (63..32)
65
32 (63..32)
32 (31..0)
32 (31..0)
FINAL ADDER
parte bassa
FINAL ADDER
parte alta
33
33
OVERFLOW CONTROL
1(32)
32
RESULT GENERATOR
U(31..0)
U(32)
Figure 9-6, top level architecture of the multiplier
9-12
MULTIPLIER
In the second stage of the pipeline, the eight partial sums of the first stage are
compressed together with the ‘carry’ vector in the device “compressor 9:2” described in
§9.3.3; the two 64 bit output vectors are then forwarded to the “final compressor 3:2”
that performs the last compression of these vectors and the additional constant described
in §9.1.1 to correct the effect of summation of the partial products using “reduced sign
extention” method.
The third stage performes summation of the two 64 bit obtained after final
compression of the partial products. This summation is executed by two summation
units of 32 bit each: the first performs summation of the 32 least significant bits of the
operands, its result, SLOW, is forwarded to the module that generates the final result of
the multiplication, whereas the second performs summation of the 32 most significant
bits and the result, SHIGH, together with carry CLOW of the first adder are forwarded to
the overflow controlling modul.
The block shown in the schematic as “generator of the result” is controlled by the bit
C(0) of the microcode related to time when the operation was initiated (two clock
periods earlier). If C(0) = 1, the requested operation is a MULA, therefore the 8 most
significant bits of Op1 latched in pipeline registers (conserved for two clock cycles, as
C(0) ) concatenated with the 24 least significant bits of the final sum SLOW are
forwarded to output as a result of the multiplication. Otherwise, in the case of a MUL
operation, SLOW is sent to the output..
Control of the over-flow is executed using SHIGH, CLOW and the bit of sign of SLOW;
in this mode it’s not necessary to calculation extended (64 bit) sum of the partial
products. The condition of over-flow is true when the sum SHIGH + CLOW (which is not
really calculated) does not constitute an extension of the sign bit SLOW. In the case of the
MULA operation the over-flow condition is true if the condition previous is true and the
8 most significant bits of SLOW doesn’t constitute an extension of the bit SLOW(23),
which represents a sign bit in the notation in complement to two of 24 bit numbers.
The diagram of figure 9-7 illustrates the entity of the VHDL implementation of the
9-13
MULTIPLIER
multiplier and it hierarchy. The components shown in the schematic of figure 9-6 but
not presented in the diagram 9-7, particularly the operand preparation unit, the registers,
the generator of the result and the over-flow controller, are implemented directly at the
top level as processes (registers, see. cap. 6) or concurrent assignment statements.
Multiplexor 8 a 1
(Library GTECH)
MUL_BOOT
(Booth encoder)
MUL_CMP_4_2
(Compressor 4:2)
MUL_PPC_4_2
(Compressor 1° stage)
MUL_CMP_9_2
(Compressor 9:2)
MUL_PPC_9_2
(Compressor 2° stage)
MUL_TOP
(multiplier)
MUL_CMP_6_2
(Compressor 6:2)
Full adder
(Library GTECH)
MUL_FCOMP
(Final compressor 2° stage)
Adder CLA 16 bit
(DWC)
MUL_ADDER
(Final adder)
Figure 9-7, hierarchy of the VHDL entity of the multiplier
The architecture of the multiplier, its operation and implementation are briefly
described below.
9.3.1 The Booth encoder
The truth table of the modified Booth encoding (Table 9-1) provides a natural
implementation of the algorithm. The three variable x of of the table can be used as
signals controlling selection of data, constituted from the produced partial products Ydi,
with 0  i  m/2  1. The following schematic illustrates this idea.
9-14
MULTIPLIER
0
Y
Y
2Y
2Y
Y
Y
0
X(2i1)
Selector
X(2i)
X(2i+1)
Pi
Figure 9-8, implementation of the modified Booth coding.
In the particular case Y = Op2, X = Op1 and m = 32. Consequently 16 selectors are
necessary to generate in parallel the 16 relevant partial products of the multiplication
Op1  Op2. Every selector is realized with tanti multiplexer 8 to 1 (cfr. 7.3) quanti are
the bit necessary to contain the multiple of Op2, namely 33. Ciascun multiplexer will
generate the j-esimo bit of the’the-esimo product partial, otherwise Pthe(j), having in income
the j-esimi bit of the multiple of Op2.
The’implementation VHDL of the encoder consists in some simple assegnamenti
competitors for the generation of the multiple of Op2, and in two instructions generated
annidate that enable of istanziare the 16  33 multiplexer 8-a-1 and to perform
automaticamente the connections.
Si observe that the multiple negative of Op2 (cioè Op2, 2Op2), require a
operation of negation that implica The’reversal of the bit of the’operating and the
amount of 1. To minimize the delay, the sum of the 1 not comes executed in this fase
but comes posticipata to the second stadium from the pipeline. For this the encoder
generates in exit a vector of 32 bit, that in the single positions 2the, (the = 0,2,4,...,32)
contains a 1 if the product partial Pthe/2 is Op2 or 2Op2.
9-15
MULTIPLIER
9.3.2 The compressors of the produced partial
The multiplier owns be dispositivi of compression of colonne of bit: four
compressors MUL_PPC_4_2, that work in parallel, the compressor MUL_PPC_9_2 and
the compressor final MUL_FCOMP. All these dispositivi, except The’last, are of THE
PPST (cfr. 9.1.2) that impiegano the circuits compressors 3:2, 4:2, 6:2 and 9:2 seen in
9.1.2.
The componenti of basis are the compressors 3:2 (full-adder) and 4:2. The
compressor 4:2 is implementato how The’entity MUL_CMP_4_2, the that architecture
is esattamente one of the schematic 9-4. The’architecture of THE MUL_CMP_6_2
(compressor 6:2) contains two istanze of the full-adder and a of THE MUL_CMP_4_2,
thus how The’entity MUL_CMP_9_2 has a’architecture cosituita from two istanze of
the full-adder and a istanza of THE MUL_CMP_ 6_2, connected how displayed in
figure 9-5.
The’entity MUL_PPC_4_2 performs the amount of 4 of the 16 produced partial
originated from the encoder of Booth. The figure 9-9 shows the type of compressor
utilized for each column the of bit
38 37
36 35 34 33 32 31
7
6
5
4
3
2
1
0
----------------3:2
3:2
4:2
Figure 9-9, compression of the produced partial in PPC 4:2
The compressor MUL_PPC_9_2 performs the compression of the eight somme
partial produced from THE 4 MUL_PPC_4_2 (of 39 bit) more the vector of 32 bit
generated from the encoder of Booth. The figure 9-10 shows, how in the event previous,
9-16
MULTIPLIER
the type of compressor utilized for each index of column.
63
54
46
39
31
24
16
8
0
9:2
4:2
6:2
3:2
6:2
Figure 9-10, compression of the somme partial in PPC 9:2
The connections between the sundry compressors are realized in respect of the rules
explained in 9.1.2.
Infine the compressor final, implementato how The’entity MUL_FCOMP, is a
addizionatore without report to three operands. Two of the operands are the resulted of
the compression executed from THE MUL_PPC_9_2, the third is the constant of 32 bit
“10101010...1011” that will be aggiunta how foreseen from the’algorithm of prevention
of the’extension of the bit of sign to leave from the position 32. The’architecture of
THE MUL_FCOMP consists of 32 full adder. Ciascun full adder performs the sum of
the bit (the+32)-esimi of the operands and the-esimo (the = 0,1,...,32) from the costate, the
bit breakdown of the’summation and the report are respectively the bit (the+32)-esimo of
the first addendo and the bit (the+32+1)-esimo of the second addendo from the sum final.
9.3.3 The addizionatori final
The addizionatori final are two istanze of the’entity MUL_ADDER. The’archittetura
9-17
MULTIPLIER
of this entity is analogous to one of the’entity TWO_OP_ADDER seen in chapter 7.
The only differenze are in the addizionatori of the part tall of the operands, that in
MUL_ADDER are of 32 bit (against the 33 of THE TWO_OP_ADDER) and in the
door d’income of the report that is absent.
9-18