Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MULTIPLIER 9. The multiplier 9.1 Parallel multiplication. The principle of operation of the parallel multipliers usually utilized in the VLSI circuits is rather simple. It has origin from the classical algorithm for the product of two binary numbers. 6 5 = 30 10 9 = 90 1 0 1 0 1 0 0 1 = 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 partial products 1 0 1 0 1 0 1 1 = 1 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 0 The example reports two multiplications, in one reported to left are considered operands without sign, whereas in one leftist the operands are represented in notation in complement to two. In general, the breakdown of the multiplication of two strings of n bit is a string of 2n1 bit got from the sum of n partial products. The partial product i-esimo, where 0 i n 1, is given from the product of the i-esimo bit of the multiplier for the multiplicand, this digit is displaced of i positions to left. One can distinguish two fases: generation of the partial products and their sum. The acceleration of the process of multiplication is based on two main Techniques: reduction of the number of the partial products, acceleration in the summation of the partial products. Let’s consider now a theory and methods utilized in the realization of the multiplier of THE’ALU of TARZAN. 9-1 MULTIPLIER 9.1.1 Reduction of partial products: the Booth coding The method used for the reduction of the number of partial products is based on a modification of the known algorithm proposed by A.D.Booth in 1950 (Booth coding, [Booth]). The modified Booth coding was invented by O. L. Macsorley (1961, [Macsorley]) and it is described below.. A binary number X = xm1,x m2,...,x0 consisting of m bit represented in notation in complement to two is considered: m 2 X= 2m1xm1 + x 2 i 0 i i (9.1) an equivalent representation of X in reduced form in basis 4 is as following: m 1 2 X= d 4 i 0 i (9.2) i The digits di are chosen from the ensemble {2, 1, 0} in agreement with the rules of the table 9-1: x2i+1 x2i x2i1 di 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 2 2 1 1 0 Tabella 9-1,The truth table of the modified Booth encoder. The elements di are determined in following way. To each step i, (0 i (m/2)1), three bits of X, x2i+1,x2i,x2i1, are examined; the correspondent value of di is obtained afterwards from the table 9-1. Two tricks is necessary to use: ‘0’ must always be concatenated to the right of X; the second trick is that m must be always even, therefore, 9-2 MULTIPLIER if necessary, the bit of sign of X must be extended. For the demonstration of correctness of this procedure see [Rubinfield]. There is a binary sequence Y of n bit. It’s necessary to calculate the product XY. In the multiplication m bit n bit, the multiplier number X of m bit has to be coded (equation (9.2)) utilizing the algorithm just described. The product XY is afterwards obtained by summing m/2 partial products, Pi = diY, with 0 i m/2, that can be easy calculated by means of operations shift and/or negation of the multiplicand Y: m 1 2 XY = m 1 2 d 4 Y = P 4 i i 0 i i 0 i (9.3) i To calculate the sum (9.3), the m/2 partial products Pi must be summed so that the partial product (i+1)-esimo is shifted two positions to the left of the partial product i-esimo (in the (9.3) the Pi are multiplied by the power i-esima of 4). Since the partial products are numbers in notation in complement to two, to obtain correct value of the sum their sign must be extended to the position m+n1 ( this has been demonstrated with an example at the beginning of the paragraph). In the parallel multipliers, the additional cost due to extensions of the sign bit, can be considerable. In order to avoid summation of the extension bits, the following method [Fadavi] can be used. It’s assumed that all the partial products are negative. In this case, the sum of all the extension bits, may be expressed as: m 1 2 2 m 1 segni = ((1)2 )4 = 2 (1) 3 i 0 n i n (9.4) The relationship (9.4) can be interpreted as a fixed number, (1)[(2m1)/3], which should be added to the sum of the partial products (not extended), starting from the position n-esima. This number expressed in binary form is equal to 1010101...010111, where there are exactly (m/2) - 1 zeros. If the partial product Pi = diY is positive, its sign bit should be simply replaced by ‘1’, in order to suppress the effect of the previously 9-3 MULTIPLIER supposed negativity of the partial product Pi. The following example illustrates the multiplication method using modified Booth coding together with the prevention of the extension of the sign bit. Example 9-1 Calculating the product X Y where: X = 010110 (= 22) Y = 1101 m=6 (= 3) n = 4 Extension bit: ‘0’ for the negativ Pi, ‘1’ per positiv Pi. Zero added 0 1 0 1 1 0 0 (1) 0 1 1 0 2 2Y 2Y (0) 1 0 1 0 2 1 (0) 1 1 0 1 Coefficients dI obtained from the table 9-1 1 0 1 0 1 1 Y costant 1 1 1 0 1 1 1 1 1 0 = 66 9.1.2 Summation of the partial products The summation of the partial products in a parallel multiplier is usually executed using the full-adders (see § 7.1) connected in a tree structure of a carry-save summator (CSST: carry save summation tree). The summator has a three-dimensional structure in which a vertical section represents a CSST for a particular position i of the bit (PPST: partial product summation tree). A vertical section of twenty bit positions has been proposed by C. S. Wallace (‘63) [Wallace] and is presenteded in figure 9-1. 9-4 MULTIPLIER 20 19 18 a b Cin Cout S 17 16 15 14 13 12 11 10 a b Cin Cout S a b Cin Cout S a b Cin Cout S a b Cin Cout S 9 7 6 a b Cin Cout S a b Cin Cout S a b Cin Cout S 8 5 a b Cin Cout S a b Cin Cout S a b Cin Cout S a b Cin Cout S 4 3 2 1 a b Cin Cout S a b Cin Cout S a b Cin Cout S a b Cin Cout S a b Cin Cout S a b Cin Cout S S20 C21 Figure 9-1, Wallace tree. If in the level j of the tree (the levels are considered growing from the top) the n number of bits is n, then k = full-adder should be used for the summation; the k 3 generated carry-signals are sent to the level j + 1 of the tree i + 1, whereas the k bit of 9-5 MULTIPLIER 2k the sum of the level j constitute inputs of the adders of the level j + 1, together 3 with the k carry bits originating from the level j of the tree i 1. Since the number of bits to sum is riduced three times at each level, the depth of the Wallace tree is O(logN), where N is the initial number of bit. The two output bits Si and Ci+1, represent respectively the bit i-esimo and the bit (i+1)-esimo of the two resultant numbers sum of which gives the final result of the multiplication. Utilizing a structure PPST and an adder of CLA type, it’s possible to perform summation of n partial products in time Or(log n). The critical path in PPST can situate entirely inside the CSST of position N, or can cross several CSST ending in the position N at the center of the PPST, how displayed in figure 9-2. Horisontal propagation N N-1 ... i+1 i i-1 ... 3 2 1 Vertical propagation In diraction to the output adder CSST of the position i-esima posizione. OUTPUT ADDER (two operands) Figure 9-2, "Partial Product Summation Tree" To increase the velocity of the PPST we must not only to reduce the number of levels in the CSST of position N, but also to assure that all the signals originated in the CSST of the lower positions (i = 1,..,N-1) do not contribute to the delay of the signal in the 9-6 MULTIPLIER position N. You can observe in figure 9-2 that the delay of the critical vertical path of the CSST position i, with i = 1,...,N-1, is not important because it contsists of fewer levels than the CSST of position N. The method used to sum the partial products in multiplier of the ALU of TARZAN is based on the principles of PPST; the essential difference is the replacing the CSST with an architecture based on particular devices called compressors [Oklobdzija]. A compressor Ci is a combinatorial device that compresses N input lines (column of bit) in the position i, to two output lines. In addition there are L input lines coming to compressor to different levels j, and the same number of lines L generated by the compressor in the same levels j, where L = N 3, what is shown on the figure 9-3. N L compressore L 2 Figure 9-3 The aim of a compressor is reduction of the number of bits to two. The lines L, both coming and leaving, are carry signals. It’s extremely important that the lines L at output Ci may be connected, in the same order, with input lines L of Ci+1 in such a way that the signals originated by a some level j of Ci may enter in the same level j of Ci+1. That preservs the order of the signals and imposes a particular structure of interconnections of the PPST. In figure 9-4 the schematic diagram of a compressor 4 : 2 is presented. 9-7 MULTIPLIER i1 Cin A i4 i3 i2 (20) 1 MUX 2:1 0 c (21) Cout (21) Figure 9-4, compressor 4 : 2 A compressor 4 : 2 has 4 input lines i1, i2, i3, i4, that collect the 4 bit that must be summed, and two output lines s and c, results of the compression. The additional lines for the carries are Cin and Cout (in this case since N = 4, L = 4 3 = 1 there is only one level). From the scheme of figure 9-4 it is possible immediately derive the booleane expressions of the outputs a, c and Cout: 9-8 MULTIPLIER a = (i1 C in ) (i2 i3 i4 ) ; (9.5) c = (i2 i3 i4 ) (i1 Cin ) (i2 i3 i4 ) (i1 Cin ) ; (9.6) Cout = (i2 i3 ) (i2 i4 ) (i3 i4 ) . (9.7) It’s easy to notice that the output carry C out doesn’t depend on the input carry Cin: the carry propagates only from the compressor of position j, which generates it, to the next one of position j+1 with a delay of a three inputs XOR-gate. Utilizing the compressors 4:2 and full-adders it is possible to create compressors of superior order. The compressors 6:2 and the compressors 9:2 are presented in figure 9-6. A normal full-adders can be used as the compressors 3:2 (in agreement with the definitions in this case the L = 0). 3:2 3:2 3:2 3:2 3:2 4:2 6:2 Figure 9-5, compressors 9:2 and 6:2 A recent study performed by V. G. Oklobdzija and D. Villeger [Oklobdzija], has demonstrated that the PPST based on compressors 4:2 may be more efficient than the traditional implementations PPST using CSST of Wallace. The advantages of usage of 9-9 MULTIPLIER the compressors of superior order (compressors 9:2) are the larger density and a regular distribution of the connections (wiring), also ((??if the their efficiency not increases with the increase from the their dimension??)). 9.2 The operations of multiplication of the ALU of TARZAN In chapter 6 (Table 6-1) it was described that the ALU of TARZAN performs two different multiplication operations denoted with the mnemonic codes MUL and MULA on 32 bit binary operands Op1 and Op2 in notation in complement to two. The result of the operation MUL is composed of the 32 least significant bits of the actual product Op1 and Op2 of 64 bit; the over-flow condition must be set if the 32 most significant bits of the result not constitute a extension of its sign bit. The operation MULA produce a result which is obtained by a multiplication and a concatenation. The multiplication is executed on of the “reduced” operands: as operands are considered the 24 least significant bits of Op1 and Op2 following representation in notation in complement to two with 24 bits. The final result is obtained concatenating to left of the 24 least significant bits of the product, the 8 most significant bits of Op1. The condition of over-flow must be true if the 48 bit product doesn’t belongs to ensemble of the numbers valid for the notation in complement to two on 24 bit. The necessity of a particular operation like MULA is caused by the format of a global address generated in TARZAN, it consists of two fields, the 8 most significant bits represent a remote address part (compare § 2.3.1); therefore multiplication of the global address doesn’t modify this field. 9-10 MULTIPLIER 9.3 The architecture of the multiplier The schematic of figure 9-6 represents the architecture of the uppest hierarchical level of the multiplier (entity MUL_TOP). It consists of a three stage pipeline, in the schematic every stage is outlined. The operations MUL and MULA are executed in the same pipeline using three clock cycles. How heretofore were shown for the summator, also the multiplier works continuously: on each cycle it performs an operation MUL or a MULA depending on the least significant bit C(0) of the microcode, this bit is really sufficient to distinguish the two operations (Table 6-1). The block marked with the name “operands preparation” has the aim to yield homogenous operations MUL and MULA. In the case of the MULA operation the 8 most significant bits of the operands are substituted with 8 bit of extension of the sign bit. The 8 most significant bits of the operand Op1 are forwarded unchanged to the third stage through the two 8 bit registers, so they may be copied to the result, how is foreseen for the operation MULA. The bit of microcode C(0) suffers the same treatment, it is propagated unchanged till the last stage where work as a control signal for the selection of the 8 most significant bits of Op1 and for the selection of the overflow signal. The Booth encoder has the task to generate the 16 (32/2) partial products of the multiplication Op1 x Op2. His operation is based on the principles described in §9.1.1 and will be described in detail in §9.3.1. Besides the 16 partial products, the encoder generates a vector of 32 bit containing eventual ‘carry’ originated during the process of coding. The 16 partial products are forwarded in groups of four from the Booth encoder to four partial products compressors (CPP 4:1), producing eight partial sums of 39 bit each, which are stored in registers to be used in the second stage of the pipeline during next clock cycle. 9-11 MULTIPLIER A B Code(0) A (31..24) OPERANDS PREPARATION Moltiplicando Factor 32 32 BOOTH ENCODER 33 33 33 PPC 4:2 2 PPC 4:2 1 33 PPC 4:2 4 PPC 4:2 3 39 39 39 39 39 39 39 39 32 39 39 39 39 39 39 39 39 32 8 1 8 1 8 1 COMPRESSOR 9 : 2 Cost 64 32 64 FINAL COMPRESSOR 3 : 2 65 65 65 32 (63..32) 65 32 (63..32) 32 (31..0) 32 (31..0) FINAL ADDER parte bassa FINAL ADDER parte alta 33 33 OVERFLOW CONTROL 1(32) 32 RESULT GENERATOR U(31..0) U(32) Figure 9-6, top level architecture of the multiplier 9-12 MULTIPLIER In the second stage of the pipeline, the eight partial sums of the first stage are compressed together with the ‘carry’ vector in the device “compressor 9:2” described in §9.3.3; the two 64 bit output vectors are then forwarded to the “final compressor 3:2” that performs the last compression of these vectors and the additional constant described in §9.1.1 to correct the effect of summation of the partial products using “reduced sign extention” method. The third stage performes summation of the two 64 bit obtained after final compression of the partial products. This summation is executed by two summation units of 32 bit each: the first performs summation of the 32 least significant bits of the operands, its result, SLOW, is forwarded to the module that generates the final result of the multiplication, whereas the second performs summation of the 32 most significant bits and the result, SHIGH, together with carry CLOW of the first adder are forwarded to the overflow controlling modul. The block shown in the schematic as “generator of the result” is controlled by the bit C(0) of the microcode related to time when the operation was initiated (two clock periods earlier). If C(0) = 1, the requested operation is a MULA, therefore the 8 most significant bits of Op1 latched in pipeline registers (conserved for two clock cycles, as C(0) ) concatenated with the 24 least significant bits of the final sum SLOW are forwarded to output as a result of the multiplication. Otherwise, in the case of a MUL operation, SLOW is sent to the output.. Control of the over-flow is executed using SHIGH, CLOW and the bit of sign of SLOW; in this mode it’s not necessary to calculation extended (64 bit) sum of the partial products. The condition of over-flow is true when the sum SHIGH + CLOW (which is not really calculated) does not constitute an extension of the sign bit SLOW. In the case of the MULA operation the over-flow condition is true if the condition previous is true and the 8 most significant bits of SLOW doesn’t constitute an extension of the bit SLOW(23), which represents a sign bit in the notation in complement to two of 24 bit numbers. The diagram of figure 9-7 illustrates the entity of the VHDL implementation of the 9-13 MULTIPLIER multiplier and it hierarchy. The components shown in the schematic of figure 9-6 but not presented in the diagram 9-7, particularly the operand preparation unit, the registers, the generator of the result and the over-flow controller, are implemented directly at the top level as processes (registers, see. cap. 6) or concurrent assignment statements. Multiplexor 8 a 1 (Library GTECH) MUL_BOOT (Booth encoder) MUL_CMP_4_2 (Compressor 4:2) MUL_PPC_4_2 (Compressor 1° stage) MUL_CMP_9_2 (Compressor 9:2) MUL_PPC_9_2 (Compressor 2° stage) MUL_TOP (multiplier) MUL_CMP_6_2 (Compressor 6:2) Full adder (Library GTECH) MUL_FCOMP (Final compressor 2° stage) Adder CLA 16 bit (DWC) MUL_ADDER (Final adder) Figure 9-7, hierarchy of the VHDL entity of the multiplier The architecture of the multiplier, its operation and implementation are briefly described below. 9.3.1 The Booth encoder The truth table of the modified Booth encoding (Table 9-1) provides a natural implementation of the algorithm. The three variable x of of the table can be used as signals controlling selection of data, constituted from the produced partial products Ydi, with 0 i m/2 1. The following schematic illustrates this idea. 9-14 MULTIPLIER 0 Y Y 2Y 2Y Y Y 0 X(2i1) Selector X(2i) X(2i+1) Pi Figure 9-8, implementation of the modified Booth coding. In the particular case Y = Op2, X = Op1 and m = 32. Consequently 16 selectors are necessary to generate in parallel the 16 relevant partial products of the multiplication Op1 Op2. Every selector is realized with tanti multiplexer 8 to 1 (cfr. 7.3) quanti are the bit necessary to contain the multiple of Op2, namely 33. Ciascun multiplexer will generate the j-esimo bit of the’the-esimo product partial, otherwise Pthe(j), having in income the j-esimi bit of the multiple of Op2. The’implementation VHDL of the encoder consists in some simple assegnamenti competitors for the generation of the multiple of Op2, and in two instructions generated annidate that enable of istanziare the 16 33 multiplexer 8-a-1 and to perform automaticamente the connections. Si observe that the multiple negative of Op2 (cioè Op2, 2Op2), require a operation of negation that implica The’reversal of the bit of the’operating and the amount of 1. To minimize the delay, the sum of the 1 not comes executed in this fase but comes posticipata to the second stadium from the pipeline. For this the encoder generates in exit a vector of 32 bit, that in the single positions 2the, (the = 0,2,4,...,32) contains a 1 if the product partial Pthe/2 is Op2 or 2Op2. 9-15 MULTIPLIER 9.3.2 The compressors of the produced partial The multiplier owns be dispositivi of compression of colonne of bit: four compressors MUL_PPC_4_2, that work in parallel, the compressor MUL_PPC_9_2 and the compressor final MUL_FCOMP. All these dispositivi, except The’last, are of THE PPST (cfr. 9.1.2) that impiegano the circuits compressors 3:2, 4:2, 6:2 and 9:2 seen in 9.1.2. The componenti of basis are the compressors 3:2 (full-adder) and 4:2. The compressor 4:2 is implementato how The’entity MUL_CMP_4_2, the that architecture is esattamente one of the schematic 9-4. The’architecture of THE MUL_CMP_6_2 (compressor 6:2) contains two istanze of the full-adder and a of THE MUL_CMP_4_2, thus how The’entity MUL_CMP_9_2 has a’architecture cosituita from two istanze of the full-adder and a istanza of THE MUL_CMP_ 6_2, connected how displayed in figure 9-5. The’entity MUL_PPC_4_2 performs the amount of 4 of the 16 produced partial originated from the encoder of Booth. The figure 9-9 shows the type of compressor utilized for each column the of bit 38 37 36 35 34 33 32 31 7 6 5 4 3 2 1 0 ----------------3:2 3:2 4:2 Figure 9-9, compression of the produced partial in PPC 4:2 The compressor MUL_PPC_9_2 performs the compression of the eight somme partial produced from THE 4 MUL_PPC_4_2 (of 39 bit) more the vector of 32 bit generated from the encoder of Booth. The figure 9-10 shows, how in the event previous, 9-16 MULTIPLIER the type of compressor utilized for each index of column. 63 54 46 39 31 24 16 8 0 9:2 4:2 6:2 3:2 6:2 Figure 9-10, compression of the somme partial in PPC 9:2 The connections between the sundry compressors are realized in respect of the rules explained in 9.1.2. Infine the compressor final, implementato how The’entity MUL_FCOMP, is a addizionatore without report to three operands. Two of the operands are the resulted of the compression executed from THE MUL_PPC_9_2, the third is the constant of 32 bit “10101010...1011” that will be aggiunta how foreseen from the’algorithm of prevention of the’extension of the bit of sign to leave from the position 32. The’architecture of THE MUL_FCOMP consists of 32 full adder. Ciascun full adder performs the sum of the bit (the+32)-esimi of the operands and the-esimo (the = 0,1,...,32) from the costate, the bit breakdown of the’summation and the report are respectively the bit (the+32)-esimo of the first addendo and the bit (the+32+1)-esimo of the second addendo from the sum final. 9.3.3 The addizionatori final The addizionatori final are two istanze of the’entity MUL_ADDER. The’archittetura 9-17 MULTIPLIER of this entity is analogous to one of the’entity TWO_OP_ADDER seen in chapter 7. The only differenze are in the addizionatori of the part tall of the operands, that in MUL_ADDER are of 32 bit (against the 33 of THE TWO_OP_ADDER) and in the door d’income of the report that is absent. 9-18