Download Design of Power and Delay Efficient 32 Bit X 32 Bit Multi

ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) Design of Power and Delay Efficient 32 Bit X 32 Bit MultiPrecision Multiplier with Operands Scheduler Jitha.K.T 1 , Remya.K.R.2 1 2 PG Scholar,M.Tech VLSI Design, [email protected],N.C.E.R.C,Pampady, Thrissur,Kerala, India Assistant Professor,ECE Department, [email protected], N.C.E.R.C, Pampady, Thrissur, Kerala, India Abstract Multipliers perform the most frequently encountered arithmetic operations in DSP applications. The proposed multi precision(MP) multiplier that incorporates variable precision, parallel processing (PP), and dedicated MP operands scheduling to provide optimum performance for a various operating conditions. The building blocks of the proposed reconfigurable multiplier can either work as independent smaller-precision multipliers or it also work parallel to form higher-precision multipliers. To reduce power consumption and delay, replaces the razor flip flop and voltage scaling unit .The Look up table (LUT) together with dynamic voltage and frequency management system configure the multiplier to work at low power consumption. The LUT stores the minimum voltages required for the multiplication of 8-bit, 16-bit and 32-bit multiplications. The multiplier consists of carry propagation adder which is replaced by carry select adder due to this a considerable delay reduction can be achieved. The MP multiplier is also consists of Frequency management unit makes the multiplier to operate at proper frequency. Finally, the proposed novel MP multiplier can further benefit from an operands scheduler that rearranges the input data, that determine the optimum voltage and frequency operating conditions for minimum power consumption. Experimental results show that the proposed MP Multiplier provides a considerable reduction in power consumption and also in delay compared with conventional razor based Dynamic Voltage Scaling MP Multiplier. When combining this MP design with LUT, Parallel processing, and the operands scheduler, delay and power reduction can be achieved to a great extent. This paper successfully demonstrates that MP multiplier architecture can allow more aggressive frequency/supply voltage scaling for improved power and delay efficiency. Keywords: Computer arithmetic, low power design, multi-precision multiplier, Input Operands Scheduler (IOS), Look Up Table (LUT). 1. Introduction Nowadays, the demand for low power, high performance portable devices has been greatly increased. The growing market of portable electronic systems demands microelectronic circuits design with low power dissipation. The power dissipation mainly due to internal components [1]. In Digital Signal Processing applications, most frequently used arithmetic operation is multiplication. So Multipliers play an important role in today’s digital signal processing and various other applications. With advances in technology, many researchers have tried and are trying to design multipliers which offer either of the following design targets – high speed, low power consumption and hence less area or even combination of them in one multiplier thus making them suitable for various high speed, low power VLSI implementation. Since Multiplication is a fundamental operation in most signal processing applications, Multipliers used in these applications have large area and consume considerable power. Therefore design of low-power multiplier has been an important part in low- power VLSI system design .Fast multipliers are essential parts of digital signal processing systems. The basic multiplication principle is two parts i.e., evaluation of partial products and accumulation of partial products. Many DSP systems is frequently truncated output due to the fixed register size and bus width inside the hardware and also disconnect the unused sections of multiplier to reduce dynamic power reduction can be achieved [8]. Because of this significant power saving can be achieved by removing some adder cells. But in case of ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) computing Least Significant (LSB) bits of product, this will cause large truncation errors. Various error compensation approaches and circuits can be used for the estimated compensation carries to the carry inputs of the retained adder cells to reduce the truncation error. 2. Background In Most applications are based on 8–16-b operands, the proposed multiplier is designed to not only perform single 16-b but also performs single 8-b, or twin parallel 8-b multiplication operations.in some applications,16 and 32 bit operands are send to smaller multiplication circuit with parallel operation reduce power consumption and also reduces area over head. Due to the complex structure and interconnections, multipliers have large amount of unbalanced path which causes unwanted signal generation and propagation. This can be avoided by proper internal balancing through architectural and transistor level optimization.in most cases of multipliers, maximum word length is provided. Hence small multiplications are done in large multipliers, this causes unwanted switching activity and also power consumption. So word length optimization is the best method in which 8-bit multiplier is reused for 16-bit and 32bit multiplication [2], [3]. Here it is possible to incorporate the pipelining for increasing the speed of the multiplier. Design of MP multiplier using operands scheduler and also storing min voltages for multiplication in LUT provide dramatic power reduction and also reduce delay. A predictable decrease in area overhead also can be achieved.in conventional DVS technique; LUT tunes to supply voltages which are stored as predefined voltage and frequency relationship by considering all worst case conditions. It will consume more time and area. In razor based DVS technique, minimum voltage required for multiplication is found using razor based feedback and also many voltage transitions occurred during the calculation of particular product [1],[7]. Due to this voltage transitions power consumptions and delay for product calculation is also increases. Because of dynamic voltage scaling unit, increased number of preemptions and frequency switching occurs which leads to worst case power consumption and delay and did not get optimum performance at various operating conditions. The concept of this paper includes: the proposed MP multiplier reduces power consumption and delay and also reduce the additional area overhead than conventional 32 x32 bit fixed width multiplier. This multiplier is also consists of operand scheduler which rearranges the input operands and hence reduce voltage transition, thus provide low power consumption. 3. Proposed Methodology Figure 1 show over all multiplier system architecture. The overall multiplier system mainly consists mainly 5 units.1) input operand scheduler which rearranges the input data and hence reduce the supply voltage transition, thus power consumption will be reduced.2) MP multiplier is one which performs multiplication with variable precision and parallel processing.3) Look Up Table is like a storage element used to store voltage required for multiplication.4) Frequency scaling unit (FSU) which provides required frequency for the multiplication 5) voltage and frequency management unit (VFMU) which is receives user requirement and control the LUT and FSU by giving suitable voltage and frequency. All the five building blocks are working together to perform multiplication with variable precision. Input operands are given through IOS blocks which rearrange the inputs to reduce voltage transitions and given to the MP multiplier. This multiplier initially works at standard supply voltage of 3.3v. Since LUT stores the minimum voltage for each combinations, depending on the incoming operands precision adjust the input voltage suitable for that particular combination. LUT stores 1.3v for 8-bit multiplication, 1.7v for 16-b, and 2.2v for 32-b multiplication. The multiplication corresponding frequencies are 1.25GHz, 0.63GHz and 0.56GHz respectively for 8-b, 16-b and 32-b multiplication. Figure 2 shows the proposed multiplier which consists nine 8x 8 bit multipliers. All these multipliers perform individual 8x8 bit operation and also can perform parallel multiplication for 16-bit and 32-bit multiplication. The processing elements of multiplier can either work as 9 independent multipliers or work in parallel to form 1, 2, or ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) three 16 x 16 bit multiplier or a single 32- bit multiplication operation. The pipelining method reduces delay and also gets fast multiplication result without error. Fig 1. Overall Multiplier System Architecture Parallel processing is the ability of a device to simultaneously process incoming different inputs. Pipelining increases instruction throughput by performing multiple operations at the same time (concurrently), but does not reduce instruction latency (the time to complete a single instruction from start to finish) as it still must go through all steps. Fig 2. Possible Configuration Modes of MP Multiplier 3.1. Look Up Table A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing, hence field-programmable. The FPGA configuration is generally specified using a hardware description language (HDL).Contemporary FPGAs have large resources of logic gates and RAM blocks to implement complex digital computations. As FPGA designs employ very fast I/So and bidirectional data buses it becomes a challenge to verify correct timing of valid data within setup time and hold time. FPGAs contain programmable logic components called logic blocks, and a hierarchy of reconfigurable interconnects that allow the ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) blocks to be wired together. FPGA consists logic blocks with logic cells. This logic cell consist of 4 input look up table used to implement various functions. A look up table is an array that replaces runtime computation with a simpler array indexing operation. The savings in terms of processing time can be significant, since retrieving a value from memory is often faster than undergoing an expensive computation or input/output operation. The tables may be pre-calculated and stored in static program storage, calculated as part of a program's initialization phase, or even stored in hardware in application-specific platforms. An FPGA has three main elements, Look-Up Tables (LUT), flip-flops, and the routing matrix, that all work together to create a very flexible device. Look-up tables are how your logic actually gets implemented. A LUT consist of some number of inputs and one output. What makes a LUT powerful is that you can program what the output should be for every single possible input. A LUT consists of a block of RAM that is indexed by the LUT's inputs. The output of the LUT is whatever value is in the indexed location in its RAM. Each LUT's output can be optionally connected to a flip-flop. Groups of LUTs and flip-flops are called slices. Here the minimum voltage required for the multiplication of each combination is stored in LUT. Depending on the given input operands, corresponding voltage will be selected from the LUT and also the corresponding frequency is adjusted by frequency scaling unit. Thus using the selected voltage and frequency, it is applied to MP multiplier along with the input operands gives the corresponding results of multiplication. 3.2. Multi-Precision and Reconfigurability Property The input interface unit of MP multiplier is shown in fig 3 which forms the sub module of multiplier. Its function is to provide data to nine independent processing elements shown in fig 2 of the MP multiplier, depending on the selected operating mode. A 2-bit mode control is used to indicate whether the inputs are of 8-bit operands (1/4/9 pairs), or 16-bit operands (1/2/3 pairs) or a single 32-bit operand. This unit uses an extra MSB sign bit for signed and unsigned multiplication of incoming operands. The Processing Elements (PE) performs computation depending on the selected operating modes, that is 8-bit, 16-bit, or 32-bit operation will be performed. Fig 3.Input Interface Unit Figure 4 shows how 3 processing elements of 8x 8 bit can combine to form a 16 x 16 bit multiplier and perform 16-bit multiplier operation. Also 32-bit multiplier can be formed by similar methods but it requires 3 x3 PE of 8 x 8 bit. The 2-bit select mode signal is used as control signal which is used to determine which PEs become active and which are inactive. Based on this 2-bit signal proper precision will be selected and perform the multiplication. If complete precision (32 x32 bit) is not exercised, the corresponding supply voltage and frequency may be scaled down depending on the work load of MP multiplier. Consider X and Y are 2n-bits wide multiplicand and multiplier; these are defined to evaluate the associated overhead to MP and configurability. XH and YH are respective n MSBs and XL and YL are respective n LSBs. XL YL, ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) XH YL, XL YH, XH YH is corresponding cross products. This can be also represented as follows: P= (XH YH) 22n + (XH YL+ XL YH)2n + XL YL (1) Fig 3. Three PEs combined to form 16 × 16 bit multiplier Where XLYL, XH YL, XL YH, and XH YH can be computed using 4, n x n bit multipliers, 2n bit reconfigurable multiplier can be constructed using adders. If X1 = XH + XL (2) Y1 = YH + YL (3) Then equation (1) will be P= (XH YH)22n + ( X1 Y1 - XH YH - XL YL)2n + XL YL (4) Compare the equations (1) & (4),one n x n bit multiplier for calculation of XH YL or XL YH and one 2n bit adder for calculation of XH YL + XL YH can be removed. Also two n-bit adders and two (2n+2)-bit subtractors are provided for calculation of XH + XL ,YH + YL and X1Y1 - XH YH - XL YL resp. Evaluation of the proposed MP multiplier can be done by comparing it with 32bit fixed width multiplier and four sub-block MP multipliers which are designed using Booth Radix-4 Wallace tree structure [4],[5]. This is done similar to proposed MP multiplier that consists of 3 sub-block multipliers. The power simulations are performed at clock frequency of 50 MHz and power supply voltage at 3.3v. From the evaluation, the proposed MP 3 sub-block multiplier architecture achieved reduction in power and area compared to fixed width multiplier design. The large size of fixed width multiplier provides an irregular and complex interconnects. This causes increase in area and also additional power consumption. Table I shows the comparison between different multipliers at 50 MHz. ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) This proposed MP 3 sub-block multiplier architecture is better than other sub-block multipliers in most of multiplication applications. The proposed MP three sub-block architecture can achieve 16% reduction in power and 28% reduction in area as compared with the conventional 32 × 32 bit fixed-width multiplier design. This limitation of tree multipliers can be addressed by this proposed MP 32 × 32 bit multiplier, which uses a more regular design to partition, regroup, and sum partial products. The proposed multiplier is easy to implement and more suitable for DSP applications. Table- 1: Area and Power Comparison of Proposed MP Multipliers against Conventional Fixed-Width Multiplier Running At 50 MHz. Schemes Power(mW) Area(mm2) 32-bit fixed width multiplier 39 0.624 32-bit 4 sub-block MP multiplier 33.36 0.736 32bit 3 sub-block MP multiplier 20 0.448 3.3. Frequency Scaling Unit Frequency scaling unit of proposed MP multiplier is used for frequency tuning to meet the system throughput requirements. This frequency unit is implemented using Voltage Controlled Oscillator (VCO) as a seven-stage current starved ring oscillator. This is because proposed multiplier can operate either as a 32-bit multiplier or as nine independent 8-bit multipliers. The power consumption of VCO for the frequency ranges is 85 to 149µW for 5MHz to 50MHz. This power consumption is negligible compared to proposed multiplier power consumption. One clock cycle is required to settle down the clock frequency. The frequency scaling unit is one which equipped with VCO is used to select frequency for each combination of multiplication. Depending on the control signal, it gives frequency that pre-calculated for 8 x 8bit, 16 x 16 bit and 32 x 32 bit for proper multiplication to reduce delay. Depending on the voltage VCO adjust the frequency. For each combination of multiplication, we can select the corresponding suitable frequency. 3.4. Input Operands Scheduler The input operands scheduler which rearranges the input data and hence reduce the supply voltage transition, thus power consumption will be reduced. It consists of range detector, buffer (RAM), and a voltage and frequency analyser. These help to rearrange the input and detect the precision and send to MP multiplier. Here proposed an IOS that will perform the following tasks: 1) reorder the input data stream so that same-precision operands are grouped together into a buffer and 2) takes the minimum supply and frequency from the LUT. Fig 5 shows Input Operand Scheduler. ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) Fig 5. Input Operands Scheduler. 4. Simulation Results The programming can be done by using VHDL and it can be simulated using Model Sims and also using Xilinx ISE. Power analysis is done by using X-power analyser. The power consumption is obtained as 20.1mW. This can be viewed from Xilinx X-power analyser. Comparing with the existing system 14.55 % of power reduction can be achieved. By avoiding the razor flip-flop and the DVS unit, considerable delay reduction can be achieved. This can be calculated from after running X-power analyser and observing the timing constraints. Delay of the proposed system is obtained 19.517ns that is 9.67% reduction in delay can be achieved. Table II shows the comparison between MP multiplier with LUT and MP multiplier with Razor flip-flop and DVS. The carry propagation adder is replaced with carry select adder; hence the corresponding delay for the calculation of sum will be reduced. CSA slightly increase the area, but considerable amount of reduction in delay achieved. When considering the reduction that also helps for fast multiplication. When considering the reduction in delay, increase in area will be negligible. Table 2: Comparison between MP Multiplier with LUT and MP Multiplier with Razor Flip-Flop and DVS Schemes MP multiplier with razor flip-flop and DVS MP multiplier with LUT Power(mW) Delay(ns) 23.52 21.605 20.1 19.517 5. Performance and Evaluation The operation of multiplier is controlled by two external signals .i.e. operating frequency and voltage signal. These two signals are automatically tuned to correct values depending on the actual workload. The simulation is done by using giving input operands and comparing the results with a PC that gives true results. The 32-bit precision data multiplication includes data word length from 17-32-bits. Also 16-bit and 8-bit data precision multiplication includes the data word length of 9-16 bits and 0-8-bits. If assumption is that a 32 × 32 bit multiplier is built using three 16 × 16 bit multipliers, then the area of the 32-bit multiplier is at least three times that of the 16-bit multiplier, discarding reconfigurability logic. The proposed MP multiplier provides the most reconfigurability while exhibiting the smallest relative area. Compared with the existing designs with the same maximum word-length of 32-bit the design exhibit a much smaller area. Some design have their maximum word-length is 16-bit instead of 32-bit. ISSN: 2393-994X KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER) Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information, Embedded and Communication Systems (ICIIECS'15) 6. Conclusion The proposed novel MP multiplier featuring with variable precision and parallel processing provides the multiplication up to 32-bit without much delay and power consumption than the any other fixed width multiplier. This architecture replaced the razor flip-flop and VSU and thus reduced delay and power consumption with considerable amount. The MP multiplier consists of an operands scheduler that rearranges the input to reduce the number of transitions of the supply voltage and thus minimized the overall power consumption of the multiplier. This MP multiplier can achieve 14.55% reduction in power consumption and 9.67% of delay reduction. The proposed MP multiplier based on LUT and DVFM provided a solution to achieve reduced delay, low power consumption and full computational flexibility for various general purpose low power DSP applications. Acknowledgements First of all, we offer thanks to our parents for their blessings. We are indebted to God Almighty for blessing us with His grace and taking our endeavour to a successful culmination. Our sincere thanks to the experts who have contributed towards the development of the paper. We, finally, thank all our friends and well-wishers who had supported us directly and indirectly during the work. References [1] [2] [3] [4] [5] [6] [7] [8] Xiaoxiao Zhang, and Amine Bermak,”32 Bit×32 Bit Multiprecision Razor-Based DynamicVoltage Scaling Multiplier With Operands Scheduler”,IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 22, no. 4,Apr 2014. M. Bhardwaj, R. Min, and A. Chandrakasan, “Quantifying and enhancing power awareness of VLSI systems,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 9, no. 6, pp. 757–772, Dec. 2001. A.Wang and A. Chandrakasan,“Energy-aware architectures for a realvalued FFT implementation,” in Proc. IEEE Int. Symp. Low Power Electron. Design, Aug. 2003. H. Lee, “A power-aware scalable pipelined booth multiplier,” in Proc. IEEE Int. SOC Conf., Sep. 2004, pp. 123–126. S.-R. Kuang and J.-P. Wang, “Design of power-efficient configurable booth multiplier,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57,no. 3, pp. 568–580, Mar. 2010. T. Yamanaka and V. G. Moshnyaga, “Reducing multiplier energy by data-driven voltage variation,” in Proc. IEEE Int. Symp. Circuits Syst., May 2004, pp. 285–288. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo,A. Hashiguchi, H. Kawahara, K. Kumano, and M. Shimura, “Dynamic voltage and frequency management for a low-power embedded microprocessor,”IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 28–35, Jan. 2005. S. D. Haynes, A. Ferrari, and P. Y. K. Cheung, “Flexible reconfigurable multiplier blocks suitable for enhancing the architecture of FPGAs,” in Proc. IEEE Custom Integr. Circuits, May 1999, pp. 191–194.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Design of Power and Delay Efficient 32 Bit X 32 Bit Multi