Download Design of Power and Delay Efficient 32 Bit X 32 Bit Multi

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Electrification wikipedia , lookup

Electrical substation wikipedia , lookup

Decibel wikipedia , lookup

Spectral density wikipedia , lookup

Heterodyne wikipedia , lookup

Resistive opto-isolator wikipedia , lookup

Electronic engineering wikipedia , lookup

Three-phase electric power wikipedia , lookup

Electric power system wikipedia , lookup

Audio power wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Power inverter wikipedia , lookup

Rectifier wikipedia , lookup

Surge protector wikipedia , lookup

Stray voltage wikipedia , lookup

Power over Ethernet wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Metadyne wikipedia , lookup

Voltage regulator wikipedia , lookup

Variable-frequency drive wikipedia , lookup

History of electric power transmission wikipedia , lookup

Power engineering wikipedia , lookup

Utility frequency wikipedia , lookup

Islanding wikipedia , lookup

Opto-isolator wikipedia , lookup

Amtrak's 25 Hz traction power system wikipedia , lookup

Buck converter wikipedia , lookup

Distribution management system wikipedia , lookup

Power electronics wikipedia , lookup

AC adapter wikipedia , lookup

Power supply wikipedia , lookup

Voltage optimisation wikipedia , lookup

Alternating current wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Mains electricity wikipedia , lookup

Transcript
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
Design of Power and Delay Efficient 32 Bit X 32 Bit MultiPrecision Multiplier with Operands Scheduler
Jitha.K.T 1 , Remya.K.R.2
1
2
PG Scholar,M.Tech VLSI Design, [email protected],N.C.E.R.C,Pampady, Thrissur,Kerala, India
Assistant Professor,ECE Department, [email protected], N.C.E.R.C, Pampady, Thrissur, Kerala, India
Abstract
Multipliers perform the most frequently encountered arithmetic operations in DSP applications. The
proposed multi precision(MP) multiplier that incorporates variable precision, parallel processing (PP),
and dedicated MP operands scheduling to provide optimum performance for a various operating
conditions. The building blocks of the proposed reconfigurable multiplier can either work as independent
smaller-precision multipliers or it also work parallel to form higher-precision multipliers. To reduce
power consumption and delay, replaces the razor flip flop and voltage scaling unit .The Look up table
(LUT) together with dynamic voltage and frequency management system configure the multiplier to work
at low power consumption. The LUT stores the minimum voltages required for the multiplication of 8-bit,
16-bit and 32-bit multiplications. The multiplier consists of carry propagation adder which is replaced
by carry select adder due to this a considerable delay reduction can be achieved. The MP multiplier is
also consists of Frequency management unit makes the multiplier to operate at proper frequency. Finally,
the proposed novel MP multiplier can further benefit from an operands scheduler that rearranges the
input data, that determine the optimum voltage and frequency operating conditions for minimum power
consumption. Experimental results show that the proposed MP Multiplier provides a considerable
reduction in power consumption and also in delay compared with conventional razor based Dynamic
Voltage Scaling MP Multiplier. When combining this MP design with LUT, Parallel processing, and the
operands scheduler, delay and power reduction can be achieved to a great extent. This paper successfully
demonstrates that MP multiplier architecture can allow more aggressive frequency/supply voltage
scaling for improved power and delay efficiency.
Keywords: Computer arithmetic, low power design, multi-precision multiplier, Input Operands Scheduler (IOS), Look Up
Table (LUT).
1. Introduction
Nowadays, the demand for low power, high performance portable devices has been greatly increased. The
growing market of portable electronic systems demands microelectronic circuits design with low power dissipation.
The power dissipation mainly due to internal components [1]. In Digital Signal Processing applications, most
frequently used arithmetic operation is multiplication. So Multipliers play an important role in today’s digital signal
processing and various other applications. With advances in technology, many researchers have tried and are trying
to design multipliers which offer either of the following design targets – high speed, low power consumption and
hence less area or even combination of them in one multiplier thus making them suitable for various high speed, low
power VLSI implementation.
Since Multiplication is a fundamental operation in most signal processing applications, Multipliers used in these
applications have large area and consume considerable power. Therefore design of low-power multiplier has been an
important part in low- power VLSI system design .Fast multipliers are essential parts of digital signal processing
systems. The basic multiplication principle is two parts i.e., evaluation of partial products and accumulation of
partial products. Many DSP systems is frequently truncated output due to the fixed register size and bus width
inside the hardware and also disconnect the unused sections of multiplier to reduce dynamic power reduction can be
achieved [8]. Because of this significant power saving can be achieved by removing some adder cells. But in case of
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
computing Least Significant (LSB) bits of product, this will cause large truncation errors. Various error
compensation approaches and circuits can be used for the estimated compensation carries to the carry inputs of the
retained adder cells to reduce the truncation error.
2. Background
In Most applications are based on 8–16-b operands, the proposed multiplier is designed to not only perform
single 16-b but also performs single 8-b, or twin parallel 8-b multiplication operations.in some applications,16 and
32 bit operands are send to smaller multiplication circuit with parallel operation reduce power consumption and also
reduces area over head.
Due to the complex structure and interconnections, multipliers have large amount of unbalanced path which
causes unwanted signal generation and propagation. This can be avoided by proper internal balancing through
architectural and transistor level optimization.in most cases of multipliers, maximum word length is provided. Hence
small multiplications are done in large multipliers, this causes unwanted switching activity and also power
consumption. So word length optimization is the best method in which 8-bit multiplier is reused for 16-bit and 32bit multiplication [2], [3]. Here it is possible to incorporate the pipelining for increasing the speed of the multiplier.
Design of MP multiplier using operands scheduler and also storing min voltages for multiplication in LUT
provide dramatic power reduction and also reduce delay. A predictable decrease in area overhead also can be
achieved.in conventional DVS technique; LUT tunes to supply voltages which are stored as predefined voltage and
frequency relationship by considering all worst case conditions. It will consume more time and area. In razor based
DVS technique, minimum voltage required for multiplication is found using razor based feedback and also many
voltage transitions occurred during the calculation of particular product [1],[7]. Due to this voltage transitions power
consumptions and delay for product calculation is also increases. Because of dynamic voltage scaling unit,
increased number of preemptions and frequency switching occurs which leads to worst case power consumption and
delay and did not get optimum performance at various operating conditions.
The concept of this paper includes: the proposed MP multiplier reduces power consumption and delay and also
reduce the additional area overhead than conventional 32 x32 bit fixed width multiplier. This multiplier is also
consists of operand scheduler which rearranges the input operands and hence reduce voltage transition, thus provide
low power consumption.
3. Proposed Methodology
Figure 1 show over all multiplier system architecture. The overall multiplier system mainly consists mainly 5
units.1) input operand scheduler which rearranges the input data and hence reduce the supply voltage transition, thus
power consumption will be reduced.2) MP multiplier is one which performs multiplication with variable precision
and parallel processing.3) Look Up Table is like a storage element used to store voltage required for
multiplication.4) Frequency scaling unit (FSU) which provides required frequency for the multiplication 5) voltage
and frequency management unit (VFMU) which is receives user requirement and control the LUT and FSU by
giving suitable voltage and frequency.
All the five building blocks are working together to perform multiplication with variable precision. Input
operands are given through IOS blocks which rearrange the inputs to reduce voltage transitions and given to the MP
multiplier. This multiplier initially works at standard supply voltage of 3.3v. Since LUT stores the minimum voltage
for each combinations, depending on the incoming operands precision adjust the input voltage suitable for that
particular combination. LUT stores 1.3v for 8-bit multiplication, 1.7v for 16-b, and 2.2v for 32-b multiplication. The
multiplication corresponding frequencies are 1.25GHz, 0.63GHz and 0.56GHz respectively for 8-b, 16-b and 32-b
multiplication.
Figure 2 shows the proposed multiplier which consists nine 8x 8 bit multipliers. All these multipliers perform
individual 8x8 bit operation and also can perform parallel multiplication for 16-bit and 32-bit multiplication. The
processing elements of multiplier can either work as 9 independent multipliers or work in parallel to form 1, 2, or
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
three 16 x 16 bit multiplier or a single 32- bit multiplication operation. The pipelining method reduces delay and
also gets fast multiplication result without error.
Fig 1. Overall Multiplier System Architecture
Parallel processing is the ability of a device to simultaneously process incoming different inputs. Pipelining
increases instruction throughput by performing multiple operations at the same time (concurrently), but does not
reduce instruction latency (the time to complete a single instruction from start to finish) as it still must go through all
steps.
Fig 2. Possible Configuration Modes of MP Multiplier
3.1. Look Up Table
A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a
designer after manufacturing, hence field-programmable. The FPGA configuration is generally specified using a
hardware description language (HDL).Contemporary FPGAs have large resources of logic gates and RAM blocks to
implement complex digital computations. As FPGA designs employ very fast I/So and bidirectional data buses it
becomes a challenge to verify correct timing of valid data within setup time and hold time. FPGAs contain
programmable logic components called logic blocks, and a hierarchy of reconfigurable interconnects that allow the
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
blocks to be wired together. FPGA consists logic blocks with logic cells. This logic cell consist of 4 input look up
table used to implement various functions.
A look up table is an array that replaces runtime computation with a simpler array indexing operation. The
savings in terms of processing time can be significant, since retrieving a value from memory is often faster than
undergoing an expensive computation or input/output operation. The tables may be pre-calculated and stored in
static program storage, calculated as part of a program's initialization phase, or even stored in hardware in
application-specific platforms.
An FPGA has three main elements, Look-Up Tables (LUT), flip-flops, and the routing matrix, that all work
together to create a very flexible device. Look-up tables are how your logic actually gets implemented. A LUT
consist of some number of inputs and one output. What makes a LUT powerful is that you can program what the
output should be for every single possible input. A LUT consists of a block of RAM that is indexed by the LUT's
inputs. The output of the LUT is whatever value is in the indexed location in its RAM. Each LUT's output can be
optionally connected to a flip-flop. Groups of LUTs and flip-flops are called slices.
Here the minimum voltage required for the multiplication of each combination is stored in LUT. Depending on
the given input operands, corresponding voltage will be selected from the LUT and also the corresponding
frequency is adjusted by frequency scaling unit. Thus using the selected voltage and frequency, it is applied to MP
multiplier along with the input operands gives the corresponding results of multiplication.
3.2. Multi-Precision and Reconfigurability Property
The input interface unit of MP multiplier is shown in fig 3 which forms the sub module of multiplier. Its function
is to provide data to nine independent processing elements shown in fig 2 of the MP multiplier, depending on the
selected operating mode. A 2-bit mode control is used to indicate whether the inputs are of 8-bit operands (1/4/9
pairs), or 16-bit operands (1/2/3 pairs) or a single 32-bit operand. This unit uses an extra MSB sign bit for signed
and unsigned multiplication of incoming operands. The Processing Elements (PE) performs computation depending
on the selected operating modes, that is 8-bit, 16-bit, or 32-bit operation will be performed.
Fig 3.Input Interface Unit
Figure 4 shows how 3 processing elements of 8x 8 bit can combine to form a 16 x 16 bit multiplier and perform
16-bit multiplier operation. Also 32-bit multiplier can be formed by similar methods but it requires 3 x3 PE of 8 x 8
bit. The 2-bit select mode signal is used as control signal which is used to determine which PEs become active and
which are inactive. Based on this 2-bit signal proper precision will be selected and perform the multiplication. If
complete precision (32 x32 bit) is not exercised, the corresponding supply voltage and frequency may be scaled
down depending on the work load of MP multiplier.
Consider X and Y are 2n-bits wide multiplicand and multiplier; these are defined to evaluate the associated
overhead to MP and configurability. XH and YH are respective n MSBs and XL and YL are respective n LSBs. XL YL,
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
XH YL, XL YH, XH YH is corresponding cross products. This can be also represented as follows:
P= (XH YH) 22n + (XH YL+ XL YH)2n + XL YL
(1)
Fig 3. Three PEs combined to form 16 × 16 bit multiplier
Where XLYL, XH YL, XL YH, and XH YH can be computed using 4, n x n bit multipliers, 2n bit reconfigurable
multiplier can be constructed using adders. If
X1 = XH + XL
(2)
Y1 = YH + YL
(3)
Then equation (1) will be
P= (XH YH)22n + ( X1 Y1 - XH YH - XL YL)2n + XL YL
(4)
Compare the equations (1) & (4),one n x n bit multiplier for calculation of XH YL or XL YH and one 2n bit adder
for calculation of XH YL + XL YH can be removed. Also two n-bit adders and two (2n+2)-bit subtractors are provided
for calculation of XH + XL ,YH + YL and X1Y1 - XH YH - XL YL resp.
Evaluation of the proposed MP multiplier can be done by comparing it with 32bit fixed width multiplier and four
sub-block MP multipliers which are designed using Booth Radix-4 Wallace tree structure [4],[5]. This is done
similar to proposed MP multiplier that consists of 3 sub-block multipliers. The power simulations are performed at
clock frequency of 50 MHz and power supply voltage at 3.3v. From the evaluation, the proposed MP 3 sub-block
multiplier architecture achieved reduction in power and area compared to fixed width multiplier design.
The large size of fixed width multiplier provides an irregular and complex interconnects. This causes increase in
area and also additional power consumption. Table I shows the comparison between different multipliers at 50 MHz.
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
This proposed MP 3 sub-block multiplier architecture is better than other sub-block multipliers in most of
multiplication applications. The proposed MP three sub-block architecture can achieve 16% reduction in power and
28% reduction in area as compared with the conventional 32 × 32 bit fixed-width multiplier design. This limitation
of tree multipliers can be addressed by this proposed MP 32 × 32 bit multiplier, which uses a more regular design to
partition, regroup, and sum partial products. The proposed multiplier is easy to implement and more suitable for
DSP applications.
Table- 1: Area and Power Comparison of Proposed MP Multipliers against Conventional Fixed-Width Multiplier
Running At 50 MHz.
Schemes
Power(mW)
Area(mm2)
32-bit fixed width multiplier
39
0.624
32-bit 4 sub-block MP multiplier
33.36
0.736
32bit 3 sub-block MP multiplier
20
0.448
3.3. Frequency Scaling Unit
Frequency scaling unit of proposed MP multiplier is used for frequency tuning to meet the system throughput
requirements. This frequency unit is implemented using Voltage Controlled Oscillator (VCO) as a seven-stage
current starved ring oscillator. This is because proposed multiplier can operate either as a 32-bit multiplier or as nine
independent 8-bit multipliers. The power consumption of VCO for the frequency ranges is 85 to 149µW for 5MHz
to 50MHz. This power consumption is negligible compared to proposed multiplier power consumption. One clock
cycle is required to settle down the clock frequency.
The frequency scaling unit is one which equipped with VCO is used to select frequency for each combination of
multiplication. Depending on the control signal, it gives frequency that pre-calculated for 8 x 8bit, 16 x 16 bit and 32
x 32 bit for proper multiplication to reduce delay. Depending on the voltage VCO adjust the frequency. For each
combination of multiplication, we can select the corresponding suitable frequency.
3.4. Input Operands Scheduler
The input operands scheduler which rearranges the input data and hence reduce the supply voltage transition, thus
power consumption will be reduced. It consists of range detector, buffer (RAM), and a voltage and frequency
analyser. These help to rearrange the input and detect the precision and send to MP multiplier. Here proposed an
IOS that will perform the following tasks: 1) reorder the input data stream so that same-precision operands are
grouped together into a buffer and 2) takes the minimum supply and frequency from the LUT. Fig 5 shows Input
Operand Scheduler.
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
Fig 5. Input Operands Scheduler.
4. Simulation Results
The programming can be done by using VHDL and it can be simulated using Model Sims and also using Xilinx
ISE. Power analysis is done by using X-power analyser. The power consumption is obtained as 20.1mW. This can
be viewed from Xilinx X-power analyser. Comparing with the existing system 14.55 % of power reduction can be
achieved. By avoiding the razor flip-flop and the DVS unit, considerable delay reduction can be achieved. This can
be calculated from after running X-power analyser and observing the timing constraints. Delay of the proposed
system is obtained 19.517ns that is 9.67% reduction in delay can be achieved. Table II shows the comparison
between MP multiplier with LUT and MP multiplier with Razor flip-flop and DVS. The carry propagation adder is
replaced with carry select adder; hence the corresponding delay for the calculation of sum will be reduced. CSA
slightly increase the area, but considerable amount of reduction in delay achieved. When considering the reduction
that also helps for fast multiplication. When considering the reduction in delay, increase in area will be negligible.
Table 2: Comparison between MP Multiplier with LUT and MP Multiplier with Razor Flip-Flop and DVS
Schemes
MP multiplier with razor flip-flop and
DVS
MP multiplier with LUT
Power(mW)
Delay(ns)
23.52
21.605
20.1
19.517
5. Performance and Evaluation
The operation of multiplier is controlled by two external signals .i.e. operating frequency and voltage signal. These
two signals are automatically tuned to correct values depending on the actual workload. The simulation is done by
using giving input operands and comparing the results with a PC that gives true results. The 32-bit precision data
multiplication includes data word length from 17-32-bits. Also 16-bit and 8-bit data precision multiplication
includes the data word length of 9-16 bits and 0-8-bits. If assumption is that a 32 × 32 bit multiplier is built using
three 16 × 16 bit multipliers, then the area of the 32-bit multiplier is at least three times that of the 16-bit multiplier,
discarding reconfigurability logic. The proposed MP multiplier provides the most reconfigurability while exhibiting
the smallest relative area. Compared with the existing designs with the same maximum word-length of 32-bit the
design exhibit a much smaller area. Some design have their maximum word-length is 16-bit instead of 32-bit.
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: III, Special Issue on IEEE Sponsored International Conference on Innovations in information,
Embedded and Communication Systems (ICIIECS'15)
6. Conclusion
The proposed novel MP multiplier featuring with variable precision and parallel processing provides the
multiplication up to 32-bit without much delay and power consumption than the any other fixed width multiplier.
This architecture replaced the razor flip-flop and VSU and thus reduced delay and power consumption with
considerable amount. The MP multiplier consists of an operands scheduler that rearranges the input to reduce the
number of transitions of the supply voltage and thus minimized the overall power consumption of the multiplier.
This MP multiplier can achieve 14.55% reduction in power consumption and 9.67% of delay reduction. The
proposed MP multiplier based on LUT and DVFM provided a solution to achieve reduced delay, low power
consumption and full computational flexibility for various general purpose low power DSP applications.
Acknowledgements
First of all, we offer thanks to our parents for their blessings. We are indebted to God Almighty for blessing us
with His grace and taking our endeavour to a successful culmination. Our sincere thanks to the experts who have
contributed towards the development of the paper. We, finally, thank all our friends and well-wishers who had
supported us directly and indirectly during the work.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Xiaoxiao Zhang, and Amine Bermak,”32 Bit×32 Bit Multiprecision Razor-Based DynamicVoltage Scaling
Multiplier With Operands Scheduler”,IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 22, no. 4,Apr
2014.
M. Bhardwaj, R. Min, and A. Chandrakasan, “Quantifying and enhancing power awareness of VLSI
systems,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 9, no. 6, pp. 757–772, Dec. 2001.
A.Wang and A. Chandrakasan,“Energy-aware architectures for a realvalued FFT implementation,” in Proc.
IEEE Int. Symp. Low Power Electron. Design, Aug. 2003.
H. Lee, “A power-aware scalable pipelined booth multiplier,” in Proc. IEEE Int. SOC Conf., Sep. 2004, pp.
123–126.
S.-R. Kuang and J.-P. Wang, “Design of power-efficient configurable booth multiplier,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 57,no. 3, pp. 568–580, Mar. 2010.
T. Yamanaka and V. G. Moshnyaga, “Reducing multiplier energy by data-driven voltage variation,” in Proc.
IEEE Int. Symp. Circuits Syst., May 2004, pp. 285–288.
Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo,A. Hashiguchi, H. Kawahara, K. Kumano, and M.
Shimura, “Dynamic voltage and frequency management for a low-power embedded microprocessor,”IEEE J.
Solid-State Circuits, vol. 40, no. 1, pp. 28–35, Jan. 2005.
S. D. Haynes, A. Ferrari, and P. Y. K. Cheung, “Flexible reconfigurable multiplier blocks suitable for
enhancing the architecture of FPGAs,” in Proc. IEEE Custom Integr. Circuits, May 1999, pp. 191–194.