Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
35 CHAPTER - 4 IMPLEMENTATION OF 128 POINT FFT PROCESSOR USING R2SDF ARCHITECTURE Table of Contents Page No. 4.0 IMPLEMENTATION OF 128 POINT FFT PROCESSOR USING R2SDF ARCHITECTURE 36 4.1 Introduction 36 4.2 Architecture of Pipelined FFT 37 4.2.1 Architecture of Radix – 2 FFT 37 4.2.2 Architecture of Radix 2/4/8 FFT 39 4.2.3 Architecture of Proposed FFT of variable length 41 Considerations of Architecture 43 4.3.1 Architectures of FFT – A comparison 43 4.3.2. CORDIC Vs Complex Multiplier 44 Circuit Design 45 4.4.1 Minimization of Word-length 45 4.4.2 Delay Line based on RAM 47 4.4.3 Current-mode technique of SRAM 48 4.4.4 Twiddle-factor ROM and Complex Multiplications 50 4.5 Results and Discussion 50 4.6 Summary 52 4.3 4.4 36 CHAPTER - 4 IMPLEMENTATION OF 128 POINT FFT PROCESSOR USING R2SDF ARCHITECTURE 4.1 INTRODUCTION This chapter illustrates implementation of a 128-point Fast Fourier Transform (FFT) processor with speed and power efficient characteristics for the applications of OFDM. The idea of ROM component and variable length are provided by this design from 128 to 2048 points of FFT processor for various applications such as Digital Transmission of Video or Audio signals and High Speed or Asymmetric Digital Subscriber Loops. The Radix-2 FFT using single delay feedback (SDF) style is employed to construct this 128-point architecture. The SRAMs in current-mode are applied in the place of shift registers in delay lines to decrease the consumption of power and area. The chip is realized with a power consumption of 176 mW at 2.3 V at a frequency of 17.8 MHz. There is a huge requirement nowadays for low-power and faster FFT in applications of OFDM. The architectures used for adopting a FFT processor are divided into three types. The first design type is the single-memory architecture having a single butterfly and memory element. The second type is the dual-memory architecture, having two memories with more throughputs when compared to singlememory architecture. The current improvements in semiconductor technology have paved the way for implementation of FFT in areas such as wideband communications, Signal processing, Biomedical instrumentation etc. Hence OFDM has become famous in various communication systems due to its capability in eliminating spectrum and channel effects. Therefore, appropriate adoption of FFT is very much required for applications of OFDM. The various sizes of FFT with respect to the standards of 37 different communication systems are shown in Table 4.1. From this table, the FFT processor of variable length becomes an inevitable component in providing appropriate solution for the above communication systems. Table 4.1 Various Communication Systems Vs Size of FFT S No System Size of FFT 1 ADSL 512 2 DAB 2048, 1024, 512. 256 3 VDSL 8192, 4096, 2048, 1024, 512 4 DVB 8192, 2048 The implementation of hardware of N-point algorithm is intensive by means of arithmetic operations and data swapping. Generally, O (log N) number of arithmetic operations is needed per clock cycle for processing the FFT. The FFT processing of high speed can be obtained in two different methods. The manipulation is done by a single processor at a frequency, greater the sample frequency of data in a generalpurpose processor. But the manipulation is done at sample frequency itself in a special purpose processor, thus consuming low power. In this chapter, various methods of optimization are implemented in the design architecture to obtain a power and areaefficient FFT processor. 4.2 ARCHITECTURE OF PIPELINED FFT 4.2.1 Architecture of Radix-2 FFT The Radix-2 multiple delay commutator (R2MDC) is a pipelined architecture using Radix-2 FFT as shown in Figure 4.1. The sequence of input data is decomposed into two streams by means of a commutator. Then the multiplication of twiddle factors and the butterfly operation is executed with appropriate scheduling of both 38 data streams. Hence (log2 N) number of Radix-2 elements, (log2 N-2) number of multipliers and (3N/2)-2 number of delay components are needed for this architecture. The Radix-2 single delay feedback style (R2SDF) is shown in Figure 4.2 and it uses the same storage for inputs and outputs of the butterfly by adopting delay elements efficiently. The multiplier element encounters a single data stream at each stage. This style of scheme includes the exact count of butterfly elements and multipliers when compared to Radix-2 multiple delay commutator, but with requirement of (N-1) delay elements. The multipliers and butterfly elements are utilized at 50% level because they are skipped during half of the time period. Input Output Fig. 4.1 Radix-2 Multiple Delay Commutator 39 Input Output Fig. 4.2 Radix-2 Single Delay Feedback 4.2.2 Architecture of Radix-2 / 4 / 8 FFT The Discrete Fourier Transform of N-point data stream is defined byN1 xz xnWnz , z 0,1,2,.........N 1 (4.1) n0 where W nz e j 2 nz N . The basic idea with respect to Radix-2 FFT is utilizing the symmetry between the twiddle factors W nz and Wnz+N/2 for simplification. The multiplication by the twiddle factors of WN/8, W3N/8, W5N/8 and W7N/8 can be utilized for simplification because their real and imaginary parts have same values. The multiplication by these twiddle factors are given by equations (4.2) to (4.4). 40 ( a jb )W N /8 ( a jb )W 5 N /8 a jb W 3 N /8 (4.2) 2 ( a b ) j (b a ) 2 a jb W 7 N /8 2 2 ( b (4.3) a ) j(a b ) (4.4) Fig. 4.3 Signal flow graph of Radix-2 / 4 / 8 FFT The signal flow graph of Radix-2 / 4 / 8 FFT is shown in Figure 4.3 [5]. This algorithm implements the butterfly of Radix-8 in terms of three stages of Radix-2 in place of a single butterfly element. Hence its signal flow graph is same as that of Radix-23 FFT [3]. By altering the architecture of Radix-2 single delay feedback appropriately, a Radix-2/4/8 architecture can be obtained. The processing elements PE1, PE2 and PE3 are employed for processing each and every stage of FFT. This architecture comprises a combination of processing elements and a multiplier for 41 multiplication of twiddle-factors in a repeated manner. The required count of delay components reduces by 50 % in each stage. The block diagram of all the processing elements is shown in Figure 4.4. Input Output Fig. 4.4 Processing Elements in Radix-2/4/8 FFT 4.2.3 Architecture of proposed FFT of variable length To decrease the power and area consumption of the chip, it is highly important to select the algorithm of FFT and the architecture possessing less complexity in computation and hardware respectively. The block diagram of Radix-2/4/8 SDF based FFT of variable length is shown in Figure 4.5. This processor can achieve FFT operations for various lengths. To include various stages of FFT, the initial two stages are Radix-2 Processing elements, having a similar structure as that of processing element (PE3) in the architecture. Each block is composed of a group of processing elements - PE1, PE2, PE3 and a multiplier as shown in Figure 4.5. For instance, if 512-point FFT is to be performed, the first two stages are bypassed by input signals 42 through the multiplexer MUX2. If a 128-point FFT is to be executed, the first stage is skipped by means of the multiplexer MUX1. Fig. 4.5 Block diagram of FFT processor of variable length 43 4.3 CONSIDERATIONS OF ARCHITECTURE 4.3.1 Architectures of FFT – a comparison In the past decade, various design architectures of FFT have been suggested with the objective of contributing successful implementation of FFT with high speed. The important features of Radix-2 / 4 / 8 FFT and various other architectures are listed in Table 4.2 [2]. In this table, their requirements of memory and complexity of computations are compared [3]. It is vivid that the count of multiplication reduces with the increase in radix number. Further, algorithms of higher radix have better exploitation of hardware in multipliers as far as bit-parallel operations are concerned. Table 4.2 Comparison of various architectures of FFT Proposed Chip / Parameters Bit- Parallel Architectures Digit- Serial Architectures Radix number Radix- 2/4/8 Radix - 4 Radix 2/4//8 Radix -4 Type of style Feedback Feedback Feed Forward Feed Forward Utilization of Complex 2 log 2 N 4 log 4 N 12 log 4 N 8 log 4 (N+1) 50 % 50 % 100 % 100 % log 8 (N-7-1) log 4 (N-7-1) 3 (log4 N- 7 1) 3 (log4 N- 7 1) Multiplier 87.5 % 75 % 100 % 100 % Size of Memory N-7-1 N-7-1 2.5 N 1.18 N ROM for Twiddle Factor 0.25 N N N 0.5 N Adder Utilization of Complex The data word-length for adders and multipliers is decreased to one digit thus requiring small number of adders in case of digit-serial architectures. The word-length of both the architectures should coincide with the throughputs of Radix-4 commutator to obtain complete usage in adders and multipliers. The area occupied by a complex multiplier dominates the area occupied by a complex adder by which considerable 44 reduction in hardware economy can be obtained with small number of multipliers. The feedback architecture requires less number of memory elements whereas the feed-forward architecture needs large in number. There is no need for large number of ROM’s if the number of multiplications is reduced appropriately. The ROM corresponding to first multiplier saves the twiddle values with a spacing of N = 2 p, but with an increase in the later stages. If the symmetry of the sine-cosine function is utilized in an efficient manner, then appropriate reduction in ROM size can be obtained. In the suggested architecture, the ROMs store only 1/8 cycle of the waveforms of sine-cosine function and the symmetry of the twiddle factors is exploited within each group thereby constructing a smaller ROM table. Hence this can employed to implement a FFT processor of varying length with less complexity of hardware. This architecture provides a trade-off between complexity of hardware and performance when compared to other architectures. 4.3.2 CORDIC Vs Complex multiplier Since the Coordinate Rotation Digital Computer (CORDIC) algorithm is efficient in vector rotation, it can be employed for multiplying the twiddle factors in FFT processors [4]. The performance of a complex multiplier and CORDIC algorithm have been evaluated and compared in this sub-section. The precision has been set to ‘16’ bits in all algorithms with an allocation of ‘19’ bits in the data path of the CORDIC design to eliminate the propagation of rounding errors. Generally a ROM table saves the sequences of rotation with N=4, 16-bits per word in this algorithm. In each stage, two 19-bit adders are employed for micro-rotation and the architecture requires 2, 16, 19 … number of full adders for 16 stages of micro-rotation. The scaling factor of 0.100110110111010 is used for additional multiplications in the final 45 stage thus requiring 2, 9, 19 adders. The delay of critical path is 19 times the delay of full adder in all stages of micro-rotation without pipelining. The size of ROM table is decreased to ‘8’ words (1 word = 23 bits) due to usage of greater radix in the subsequent stages. The outputs of CORDIC are represented in the arithmetic format and converted to binary format once the butterfly operation is performed by carrylook ahead adders [7]. In the suggested chip, the complex multiplication involves five additions and three multiplications. The addition is performed by means of carry-select adders with a delay of about 8 TFA at the maximum and each carry select adder employs ‘30’ full adders and ‘63’ full adders in the initial 16-bit addition and the final 33-bit addition respectively. The delay of critical path is decreased to 7 TFA by using Wallace tree multipliers. A Wallace tree multiplier requires nearly ‘280’ full adders and two stages of pipeline are included at the start and end of the multiplication process. The CORDIC algorithm seems to be slower in function without pipelining whereas the multiplication by Wallace tree decreases the delay of critical path of complex multiplier. With regard to relationship between speed and area, the FFT processor can consume low power, if complex multiplier is employed for the multiplication of twiddle factors. 4.4 CIRCUIT DESIGN The FFT processor of variable-length should be designed appropriately to reduce the consumption of power in order to function as an inevitable component in OFDM systems. 4.4.1 Minimization of Word-length In the architecture of this FFT processor, the word lengths of different signals are reduced as per their requirements of signal-to-noise ratio (SNR). The input signals 46 with Gaussian noise in fixed point arithmetic are given into FFT to determine the optimum value of word length. The output signals in frequency-domain are received and the signal-to-noise ratio (SNR) is calculated. Figure 4.6 (a) shows the output SNR with respect to input word length of FFT under various conditions of input SNR. Then the input word length is determined to be ‘9’ bits. Based upon accuracy of the sine and cosine tables, the output SNR with respect to word length of twiddle factors is shown in Figure 4.6 (b), when the SNR of input signal is 30 dB. Thus a 9 bit word length is selected for the twiddle factors. The process of reduction of word length passes component by component and all word lengths inside the processor are computed. To maintain the consumption of power, true-single-phase-clock (TSPC) flip-flops are employed in the ring counters. Fig. 4.6 (a) Output SNR with respect to input word length of FFT 47 Fig. 4.6 (b) Output SNR with respect to word length of Twiddle factors 4.4.2 Delay Line based on RAM The FFT processor based on single delay feedback (SDF) architecture requires more number of delay lines which are adopted in shift registers, as shown in Figure 4.7 (a). The data transits in a forward direction at every clock edge and 50 % of the total registers modify their states, wasting large amount of power. The SRAM has been employed in the place of shift registers to reduce the consumption of power and area. Hence a dual port memory is needed to perform the memory access operations in a single clock cycle. Two SRAMs having single port are implemented and a singleport memory achieves an area saving of 33% when compared to a dual-port memory. Here the usage of a single-port SRAM is illustrated in Figure 4.7 (b). The read and the write operations are achieved in first half and next half of the clock cycles respectively. Two registers are inserted before and after the processing element to avoid the access of data becoming critical paths in SRAM. 48 Input Output Input Output Fig. 4.7 (a) Delay line based on shift-register (b) Delay line based on SRAM 4.4.3 Current-mode technique of SRAM This technique can be employed in reading and writing the contents of SRAM cell in order to decrease the consumption of power [8]. The voltage levels of bit lines and data lines of SRAM are maintained very smaller in this technique thereby reducing the dissipation of dynamic power in a significant manner. The SRAM cell in current mode comprises seven transistors, one transistor in addition to the traditional Sixtransistor cell as shown in Figure 4.8 (a). This additional transistor, Meq, is used to get the output voltages of the two inverters equalized and hence a small difference in current is detected by means of access transistors activated by the word enable signal. 49 Fig. 4.8 (a) 7 T SRAM cell based on current-mode (b) SRAM write circuit When the additional transistor is in off state, the SRAM cell will function as a traditional six transistor memory cell. The difference in current ‘DI’ is available on the write data lines ‘wdp’ and ‘wdn’ during the write access operation. The current conveyor of N-type material as given in Figure 4.8 (b) is activated by signal ‘WY’. Then bit lines ‘blp’ and ‘bln’ will have currents without any attenuation. Since the control signal ‘WY’ is activated, the circuit is closed between ‘wdp’ and ‘wdn’. The voltages at these data lines ‘wdp’ and ‘wdn’ are equal to VDD ðV1 þ V2Þ. Hence the voltage levels are maintained very low on these data lines. The read operation is performed by a sense amplifier and a column decoder in this SRAM. 50 4.4.4 Twiddle-factor ROM and Complex Multiplications In this FFT processor, each multiplication of the twiddle factors - WN=8, W3N =8 , W5N=8 and Wp7N=8 is decremented to two real multiplications by an integer of ’2’ and still can be reduced to shift and add operations. The block diagram of Twiddle-factor ROM is shown in Figure 4.9. 4.5 RESULTS AND DISCUSSION The complete architecture without including the SRAM components is implemented by a hardware description language (VHDL). The SRAM layout includes ring counters, control units, and SRAM cells designed appropriately. This FFT processor is implemented with 0.35 µm technology. The die photo of the chip is shown in Figure 4.10. The multipliers and the processing elements are marked as ‘MUT’ and ‘Ux’ respectively along with their twiddle-factor ROMs. The longer and shorter delay lines are realized by SRAM and registers respectively taking the circuit overheads into consideration. The complete details of the chip are given in Table 4.3. The FFT processor dissipates 176 mW with an operating frequency up to 17.8 MHz at 2.3 V and dissipates 640 mW with an operating frequency up to 45 MHz at 3.3 V. It is highly difficult to make a good comparison with respect to size of FFT, type of algorithm, supply voltage, consumption of power, clock rate etc. in different technologies of CMOS [1]. Three parameters have been adopted to compare and estimate the statistics with the assumption that a 128-point FFT is performed. They are specified byNormalized area = Area of 128 - point FFT / Technology (4.5) FFT/Energy = Technology / Power of 128 - point FFT * Execution Time *10-6 (4.6) Energy * Time = Execution Time / (FFT/Energy) (4.7) 51 Fig. 4.9 Block diagram of Twiddle-factor ROM Fig. 4.10 Die photo of suggested FFT processor 52 Table 4.3 Complete Summary of Chip Process TSMC 0.35 1P4M Area 3:9 mm_ 5:5 mm Transistor count 5, 98, 078 Maximum frequency 45 MHz at 3.3V Power consumption (at high speed) 640 mW (at 45 MHz, 3.3 V) Power consumption (at low 176 mW (at 17.8 MHz, 2.3 V) voltage) Package 68 PGA It is inferred from Table 4.3 that the suggested chip includes both area reduction and product of energy–time given by equation (4.7). The speed of execution of FFT processor is slower at 1.1 V and this voltage prevents it from applications of highspeed in spite of its good energy efficiency. 4.6 SUMMARY In this chapter, the architecture of a FFT processor appropriate for various communication systems such as Digital Transmission of Audio or Video signals, Digital Subscriber Loops etc. for achieving complex FFTs of variable length 128 or more has been discussed. The suggested FFT processor achieves complete utilization of hardware and least consumption of power. The single delay feedback (SDF) FFT architecture needs least number of delay components and the Radix-2 FFT employs with shift and-add operations in the place of complex multipliers. The processor was realized using a 0.35 µm technology. The experimental statistics indicate that the processor consumes a power of 640 mW at 3.3 V up to 45 MHz but consumes only 176 mW at 17.8 MHz if the voltage is reduced to 2.3 V.