Download CHAPTER - 4 IMPLEMENTATION OF 128 POINT FFT PROCESSOR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Chirp spectrum wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Opto-isolator wikipedia , lookup

Microprocessor wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Spectrum analyzer wikipedia , lookup

Transcript
35
CHAPTER - 4
IMPLEMENTATION OF 128 POINT FFT PROCESSOR
USING R2SDF ARCHITECTURE
Table of Contents
Page No.
4.0
IMPLEMENTATION OF 128 POINT FFT PROCESSOR
USING R2SDF ARCHITECTURE
36
4.1
Introduction
36
4.2
Architecture of Pipelined FFT
37
4.2.1 Architecture of Radix – 2 FFT
37
4.2.2 Architecture of Radix 2/4/8 FFT
39
4.2.3 Architecture of Proposed FFT of variable length
41
Considerations of Architecture
43
4.3.1 Architectures of FFT – A comparison
43
4.3.2. CORDIC Vs Complex Multiplier
44
Circuit Design
45
4.4.1 Minimization of Word-length
45
4.4.2 Delay Line based on RAM
47
4.4.3 Current-mode technique of SRAM
48
4.4.4 Twiddle-factor ROM and Complex Multiplications
50
4.5
Results and Discussion
50
4.6
Summary
52
4.3
4.4
36
CHAPTER - 4
IMPLEMENTATION OF 128 POINT FFT PROCESSOR
USING R2SDF ARCHITECTURE
4.1 INTRODUCTION
This chapter illustrates implementation of a 128-point Fast Fourier Transform
(FFT) processor with speed and power efficient characteristics for the applications of
OFDM. The idea of ROM component and variable length are provided by this design
from 128 to 2048 points of FFT processor for various applications such as Digital
Transmission of Video or Audio signals and High Speed or Asymmetric Digital
Subscriber Loops. The Radix-2 FFT using single delay feedback (SDF) style is
employed to construct this 128-point architecture. The SRAMs in current-mode are
applied in the place of shift registers in delay lines to decrease the consumption of
power and area. The chip is realized with a power consumption of 176 mW at 2.3 V at
a frequency of 17.8 MHz.
There is a huge requirement nowadays for low-power and faster FFT in
applications of OFDM. The architectures used for adopting a FFT processor are
divided into three types. The first design type is the single-memory architecture
having a single butterfly and memory element. The second type is the dual-memory
architecture, having two memories with more throughputs when compared to singlememory architecture. The current improvements in semiconductor technology have
paved the way for implementation of FFT in areas such as wideband communications,
Signal processing, Biomedical instrumentation etc. Hence OFDM has become famous
in various communication systems due to its capability in eliminating spectrum and
channel effects. Therefore, appropriate adoption of FFT is very much required for
applications of OFDM. The various sizes of FFT with respect to the standards of
37
different communication systems are shown in Table 4.1. From this table, the FFT
processor of variable length becomes an inevitable component in providing
appropriate solution for the above communication systems.
Table 4.1 Various Communication Systems Vs Size of FFT
S No System
Size of FFT
1
ADSL
512
2
DAB
2048, 1024, 512. 256
3
VDSL
8192, 4096, 2048, 1024, 512
4
DVB
8192, 2048
The implementation of hardware of N-point algorithm is intensive by means of
arithmetic operations and data swapping. Generally, O (log N) number of arithmetic
operations is needed per clock cycle for processing the FFT. The FFT processing of
high speed can be obtained in two different methods. The manipulation is done by a
single processor at a frequency, greater the sample frequency of data in a generalpurpose processor. But the manipulation is done at sample frequency itself in a special
purpose processor, thus consuming low power. In this chapter, various methods of
optimization are implemented in the design architecture to obtain a power and areaefficient FFT processor.
4.2 ARCHITECTURE OF PIPELINED FFT
4.2.1 Architecture of Radix-2 FFT
The Radix-2 multiple delay commutator (R2MDC) is a pipelined architecture
using Radix-2 FFT as shown in Figure 4.1. The sequence of input data is decomposed
into two streams by means of a commutator. Then the multiplication of twiddle
factors and the butterfly operation is executed with appropriate scheduling of both
38
data streams. Hence (log2 N) number of Radix-2 elements, (log2 N-2) number of
multipliers and (3N/2)-2 number of delay components are needed for this architecture.
The Radix-2 single delay feedback style (R2SDF) is shown in Figure 4.2 and it uses
the same storage for inputs and outputs of the butterfly by adopting delay elements
efficiently. The multiplier element encounters a single data stream at each stage. This
style of scheme includes the exact count of butterfly elements and multipliers when
compared to Radix-2 multiple delay commutator, but with requirement of (N-1) delay
elements. The multipliers and butterfly elements are utilized at 50% level because
they are skipped during half of the time period.
Input
Output
Fig. 4.1 Radix-2 Multiple Delay Commutator
39
Input
Output
Fig. 4.2 Radix-2 Single Delay Feedback
4.2.2 Architecture of Radix-2 / 4 / 8 FFT
The Discrete Fourier Transform of N-point data stream is defined byN1
xz  xnWnz , z  0,1,2,.........N 1
(4.1)
n0
where W nz  e
 j 2
nz
N
. The basic idea with respect to Radix-2 FFT is utilizing the
symmetry between the twiddle factors W nz and Wnz+N/2 for simplification. The
multiplication by the twiddle factors of WN/8, W3N/8, W5N/8 and W7N/8 can be utilized
for simplification because their real and imaginary parts have same values. The
multiplication by these twiddle factors are given by equations (4.2) to (4.4).
40
( a  jb )W N /8   ( a  jb )W 5 N /8

a 
jb  W
3 N /8
(4.2)
2
 ( a  b )  j (b  a ) 
2
   a  jb  W
7 N /8

2
2
( b
(4.3)
 a )  j(a  b )
(4.4)
Fig. 4.3 Signal flow graph of Radix-2 / 4 / 8 FFT
The signal flow graph of Radix-2 / 4 / 8 FFT is shown in Figure 4.3 [5]. This
algorithm implements the butterfly of Radix-8 in terms of three stages of Radix-2 in
place of a single butterfly element. Hence its signal flow graph is same as that of
Radix-23 FFT [3]. By altering the architecture of Radix-2 single delay feedback
appropriately, a Radix-2/4/8 architecture can be obtained. The processing elements PE1, PE2 and PE3 are employed for processing each and every stage of FFT. This
architecture comprises a combination of processing elements and a multiplier for
41
multiplication of twiddle-factors in a repeated manner. The required count of delay
components reduces by 50 % in each stage. The block diagram of all the processing
elements is shown in Figure 4.4.
Input
Output
Fig. 4.4 Processing Elements in Radix-2/4/8 FFT
4.2.3 Architecture of proposed FFT of variable length
To decrease the power and area consumption of the chip, it is highly important to
select the algorithm of FFT and the architecture possessing less complexity in
computation and hardware respectively. The block diagram of Radix-2/4/8 SDF based
FFT of variable length is shown in Figure 4.5. This processor can achieve FFT
operations for various lengths. To include various stages of FFT, the initial two stages
are Radix-2 Processing elements, having a similar structure as that of processing
element (PE3) in the architecture. Each block is composed of a group of processing
elements - PE1, PE2, PE3 and a multiplier as shown in Figure 4.5. For instance, if
512-point FFT is to be performed, the first two stages are bypassed by input signals
42
through the multiplexer MUX2. If a 128-point FFT is to be executed, the first stage is
skipped by means of the multiplexer MUX1.
Fig. 4.5 Block diagram of FFT processor of variable length
43
4.3 CONSIDERATIONS OF ARCHITECTURE
4.3.1
Architectures of FFT – a comparison
In the past decade, various design architectures of FFT have been suggested with
the objective of contributing successful implementation of FFT with high speed. The
important features of Radix-2 / 4 / 8 FFT and various other architectures are listed in
Table 4.2 [2]. In this table, their requirements of memory and complexity of
computations are compared [3]. It is vivid that the count of multiplication reduces
with the increase in radix number. Further, algorithms of higher radix have better
exploitation of hardware in multipliers as far as bit-parallel operations are concerned.
Table 4.2 Comparison of various architectures of FFT
Proposed Chip /
Parameters
Bit- Parallel Architectures
Digit- Serial Architectures
Radix number
Radix- 2/4/8
Radix - 4
Radix 2/4//8
Radix -4
Type of style
Feedback
Feedback
Feed Forward
Feed Forward
Utilization of Complex
2 log 2 N
4 log 4 N
12 log 4 N
8 log 4 (N+1)
50 %
50 %
100 %
100 %
log 8 (N-7-1)
log 4 (N-7-1)
3 (log4 N- 7 1)
3 (log4 N- 7 1)
Multiplier
87.5 %
75 %
100 %
100 %
Size of Memory
N-7-1
N-7-1
2.5 N
1.18 N
ROM for Twiddle Factor
0.25 N
N
N
0.5 N
Adder
Utilization of Complex
The data word-length for adders and multipliers is decreased to one digit thus
requiring small number of adders in case of digit-serial architectures. The word-length
of both the architectures should coincide with the throughputs of Radix-4 commutator
to obtain complete usage in adders and multipliers. The area occupied by a complex
multiplier dominates the area occupied by a complex adder by which considerable
44
reduction in hardware economy can be obtained with small number of multipliers.
The feedback architecture requires less number of memory elements whereas the
feed-forward architecture needs large in number. There is no need for large number of
ROM’s if the number of multiplications is reduced appropriately. The ROM
corresponding to first multiplier saves the twiddle values with a spacing of N = 2 p,
but with an increase in the later stages. If the symmetry of the sine-cosine function is
utilized in an efficient manner, then appropriate reduction in ROM size can be
obtained.
In the suggested architecture, the ROMs store only 1/8 cycle of the waveforms of
sine-cosine function and the symmetry of the twiddle factors is exploited within each
group thereby constructing a smaller ROM table. Hence this can employed to
implement a FFT processor of varying length with less complexity of hardware. This
architecture provides a trade-off between complexity of hardware and performance
when compared to other architectures.
4.3.2
CORDIC Vs Complex multiplier
Since the Coordinate Rotation Digital Computer (CORDIC) algorithm is efficient
in vector rotation, it can be employed for multiplying the twiddle factors in FFT
processors [4]. The performance of a complex multiplier and CORDIC algorithm
have been evaluated and compared in this sub-section. The precision has been set to
‘16’ bits in all algorithms with an allocation of ‘19’ bits in the data path of the
CORDIC design to eliminate the propagation of rounding errors. Generally a ROM
table saves the sequences of rotation with N=4, 16-bits per word in this algorithm. In
each stage, two 19-bit adders are employed for micro-rotation and the architecture
requires 2, 16, 19 … number of full adders for 16 stages of micro-rotation. The
scaling factor of 0.100110110111010 is used for additional multiplications in the final
45
stage thus requiring 2, 9, 19 adders. The delay of critical path is 19 times the delay of
full adder in all stages of micro-rotation without pipelining. The size of ROM table is
decreased to ‘8’ words (1 word = 23 bits) due to usage of greater radix in the
subsequent stages. The outputs of CORDIC are represented in the arithmetic format
and converted to binary format once the butterfly operation is performed by carrylook ahead adders [7].
In the suggested chip, the complex multiplication involves five additions and three
multiplications. The addition is performed by means of carry-select adders with a
delay of about 8 TFA at the maximum and each carry select adder employs ‘30’ full
adders and ‘63’ full adders in the initial 16-bit addition and the final 33-bit addition
respectively. The delay of critical path is decreased to 7 TFA by using Wallace tree
multipliers. A Wallace tree multiplier requires nearly ‘280’ full adders and two stages
of pipeline are included at the start and end of the multiplication process. The
CORDIC algorithm seems to be slower in function without pipelining whereas the
multiplication by Wallace tree decreases the delay of critical path of complex
multiplier. With regard to relationship between speed and area, the FFT processor can
consume low power, if complex multiplier is employed for the multiplication of
twiddle factors.
4.4
CIRCUIT DESIGN
The FFT processor of variable-length should be designed appropriately to reduce
the consumption of power in order to function as an inevitable component in OFDM
systems.
4.4.1 Minimization of Word-length
In the architecture of this FFT processor, the word lengths of different signals are
reduced as per their requirements of signal-to-noise ratio (SNR). The input signals
46
with Gaussian noise in fixed point arithmetic are given into FFT to determine the
optimum value of word length. The output signals in frequency-domain are received
and the signal-to-noise ratio (SNR) is calculated. Figure 4.6 (a) shows the output SNR
with respect to input word length of FFT under various conditions of input SNR. Then
the input word length is determined to be ‘9’ bits. Based upon accuracy of the sine
and cosine tables, the output SNR with respect to word length of twiddle factors is
shown in Figure 4.6 (b), when the SNR of input signal is 30 dB. Thus a 9 bit word
length is selected for the twiddle factors. The process of reduction of word length
passes component by component and all word lengths inside the processor are
computed. To maintain the consumption of power, true-single-phase-clock (TSPC)
flip-flops are employed in the ring counters.
Fig. 4.6 (a) Output SNR with respect to input word length of FFT
47
Fig. 4.6 (b) Output SNR with respect to word length of Twiddle factors
4.4.2 Delay Line based on RAM
The FFT processor based on single delay feedback (SDF) architecture requires
more number of delay lines which are adopted in shift registers, as shown in Figure
4.7 (a). The data transits in a forward direction at every clock edge and 50 % of the
total registers modify their states, wasting large amount of power. The SRAM has
been employed in the place of shift registers to reduce the consumption of power and
area. Hence a dual port memory is needed to perform the memory access operations
in a single clock cycle. Two SRAMs having single port are implemented and a singleport memory achieves an area saving of 33% when compared to a dual-port memory.
Here the usage of a single-port SRAM is illustrated in Figure 4.7 (b). The read and the
write operations are achieved in first half and next half of the clock cycles
respectively. Two registers are inserted before and after the processing element to
avoid the access of data becoming critical paths in SRAM.
48
Input
Output
Input
Output
Fig. 4.7 (a) Delay line based on shift-register (b) Delay line based on SRAM
4.4.3 Current-mode technique of SRAM
This technique can be employed in reading and writing the contents of SRAM cell
in order to decrease the consumption of power [8]. The voltage levels of bit lines and
data lines of SRAM are maintained very smaller in this technique thereby reducing
the dissipation of dynamic power in a significant manner. The SRAM cell in current
mode comprises seven transistors, one transistor in addition to the traditional Sixtransistor cell as shown in Figure 4.8 (a). This additional transistor, Meq, is used to get
the output voltages of the two inverters equalized and hence a small difference in
current is detected by means of access transistors activated by the word enable signal.
49
Fig. 4.8 (a) 7 T SRAM cell based on current-mode (b) SRAM write circuit
When the additional transistor is in off state, the SRAM cell will function as a
traditional six transistor memory cell. The difference in current ‘DI’ is available on
the write data lines ‘wdp’ and ‘wdn’ during the write access operation. The current
conveyor of N-type material as given in Figure 4.8 (b) is activated by signal ‘WY’.
Then bit lines ‘blp’ and ‘bln’ will have currents without any attenuation. Since the
control signal ‘WY’ is activated, the circuit is closed between ‘wdp’ and ‘wdn’. The
voltages at these data lines ‘wdp’ and ‘wdn’ are equal to VDD ðV1 þ V2Þ. Hence the
voltage levels are maintained very low on these data lines. The read operation is
performed by a sense amplifier and a column decoder in this SRAM.
50
4.4.4 Twiddle-factor ROM and Complex Multiplications
In this FFT processor, each multiplication of the twiddle factors - WN=8, W3N
=8
,
W5N=8 and Wp7N=8 is decremented to two real multiplications by an integer of ’2’ and
still can be reduced to shift and add operations. The block diagram of Twiddle-factor
ROM is shown in Figure 4.9.
4.5 RESULTS AND DISCUSSION
The complete architecture without including the SRAM components is
implemented by a hardware description language (VHDL). The SRAM layout
includes ring counters, control units, and SRAM cells designed appropriately. This
FFT processor is implemented with 0.35 µm technology. The die photo of the chip is
shown in Figure 4.10. The multipliers and the processing elements are marked as
‘MUT’ and ‘Ux’ respectively along with their twiddle-factor ROMs. The longer and
shorter delay lines are realized by SRAM and registers respectively taking the circuit
overheads into consideration. The complete details of the chip are given in Table 4.3.
The FFT processor dissipates 176 mW with an operating frequency up to 17.8 MHz at
2.3 V and dissipates 640 mW with an operating frequency up to 45 MHz at 3.3 V. It is
highly difficult to make a good comparison with respect to size of FFT, type of
algorithm, supply voltage, consumption of power, clock rate etc. in different
technologies of CMOS [1]. Three parameters have been adopted to compare and
estimate the statistics with the assumption that a 128-point FFT is performed. They
are specified byNormalized area = Area of 128 - point FFT / Technology
(4.5)
FFT/Energy = Technology / Power of 128 - point FFT * Execution Time *10-6 (4.6)
Energy * Time = Execution Time / (FFT/Energy)
(4.7)
51
Fig. 4.9 Block diagram of Twiddle-factor ROM
Fig. 4.10 Die photo of suggested FFT processor
52
Table 4.3 Complete Summary of Chip
Process
TSMC 0.35 1P4M
Area
3:9 mm_ 5:5 mm
Transistor count
5, 98, 078
Maximum frequency
45 MHz at 3.3V
Power consumption (at high speed) 640 mW (at 45 MHz, 3.3 V)
Power consumption (at low
176 mW (at 17.8 MHz, 2.3 V)
voltage)
Package
68 PGA
It is inferred from Table 4.3 that the suggested chip includes both area reduction
and product of energy–time given by equation (4.7). The speed of execution of FFT
processor is slower at 1.1 V and this voltage prevents it from applications of highspeed in spite of its good energy efficiency.
4.6 SUMMARY
In this chapter, the architecture of a FFT processor appropriate for various
communication systems such as Digital Transmission of Audio or Video signals,
Digital Subscriber Loops etc. for achieving complex FFTs of variable length 128 or
more has been discussed. The suggested FFT processor achieves complete utilization
of hardware and least consumption of power. The single delay feedback (SDF) FFT
architecture needs least number of delay components and the Radix-2 FFT employs
with shift and-add operations in the place of complex multipliers. The processor was
realized using a 0.35 µm technology. The experimental statistics indicate that the
processor consumes a power of 640 mW at 3.3 V up to 45 MHz but consumes only
176 mW at 17.8 MHz if the voltage is reduced to 2.3 V.