Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
VLSI Implementation of the binDCT Sachin Gangaputra Department of Electrical and Computer Engineering The Johns Hopkins University Baltimore, MD 21218 Abstract - The Discrete Cosine Transform (DCT) is the most widely used transform for image compression. The binDCT [1] has shown to be a promising alternative to the DCT for its implementation simplicity, close performance and compatibility to the DCT. This paper describes a hardware realization of the 2-D binDCT. The chip is to be implemented on the 1.2 MOSIS process. I. INTRODUCTION The DCT is of significant interest because of its use in image data compression. In particular, it has been widely recognized as the most effective technique among the various transform coding methods and various DCT have been implemented. A binary cosine transform denoted binDCT[1] has been shown to be functionally compatible to the DCT and performs as good as the DCT. The binDCT requires a modest amount of computations: 14 shifts and 31 additions per 8 input samples. The binDCT has been tested in software and its performance in terms of speed and accuracy has been benchmarked with that of the DCT. This paper describes a VLSI implementation of the binDCT to evaluate its hardware performance with respect to other DCT implementations in terms of speed, power requirements and sizes of the chip. The design considered here is that of minimizing chip size and thereby distributed arithmetic is employed in computing the DCT coefficients. The DCT is a separable transform and hence the 2D DCT is performed by first performing the 1D DCT on the rows and then on the resulting columns. This technique, albeit requiring storage space for intermediate coefficients, does provide a more simplistic and efficient design. Fig. 1. The forward binary DCT The rest of the paper is organized as follows. In section II, the architecture of the proposed chip is presented. Section III, describes the implementations of the individual blocks that build up the chip. Section IV, provides concluding remarks on this work. II ARCHITECTURE The binDCT chip can be divided into the following functional blocks, namely, the 1D forward binDCT, the transposition memory and the control circuitry. The 1D forward DCT is implemented using distributed arithmetic in a parallel manner. Distributed arithmetic would imply that initially the lowest bit corresponding to all inputs is processed, this results in the output lowest bit. Then we move to higher bits to generate the output bit stream. The architecture of the 1D DCT would be similar to the flow diagram as in fig.1. Then the next 8 words are read form the out put, this is then repeated till we read in all the 64 input samples. Appropriately 8 columns are read from the memory and are converted back to the bit-serial word-parallel format. 5. This is then again passed through the 1D DCT. The resulting bit streams is converted to the word-serial bit-parallel format and is written on to the output bus. The design of the of the forward bin DCT requires a delay of 19 before the least significant bit of the output is generated. Assuming that the memory read/write operations are being performed in parallel to the other DCT coefficients being processed, we need 480 clock cycles for all the 64 coefficients to be generated. III IMPLEMENTATION Fig 2. The binDCT chip architecture. The 2D transform is implemented by first performing a 1D transform on all the rows followed by a 1D transform on the resulting columns. This would require an on chip memory to store the intermediate coefficients. The control circuitry performs the timing and routing operations of the bits from and to the memory. The input is assumed to be in a bit-parallel word-serial format. The sequence of operations would be as follows – The first 8 words are converted form the bit-parallel, wordserial format to the word-parallel, bitserial format (this is required for the distributed arithmetic) using serial to parallel converters. The bit streams through a shift register are fed to the 1D forward DCT. The output of the 1D DCT is then converted back to a word parallel format, which is subsequently written onto the transposition memory. This section describes the implementation of the chip. It outlines the design and the simulations of the various building blocks that make up the binDCT chip. A] LATCH The latch provides a delay of a single sample. The design is that of a masterslave flip-flop requiring two non overlapping clocks. B] FULL ADDER The full adder serves as the basic block to all our arithmetic computations. C} 1 BIT ADDER & SUBTRACTOR As we use distributed arithmetic to compute the coefficients, a carry over should be accounted for in the next bit entering. Therefore a 1 bit adder is implemented by latching the carryout output of the full adder back to the carry in. 1/16+1/8+1/2. The resulting delay of the 7/8 and the 3/8 implementations have a delay of 3 samples wrt the input and the 11/16 implementation has a delay of 4 wrt the input. Fig. 5. Implementing the factors Subtraction is performed by 2’s complement addition. The second input to the full adder is inverted and the initial carry is set to 1 by the corresponding latch. Fig.3. Simulation adder/subtracter of 1 Other factors can be derived from these basic factors by adding corresponding delays. Eg. 7/16 can be implemented using the 7/8 with another delay. For equal delays on all parallel lines corresponding delays have to be added to the other paths. bit D] IMPLEMENTING FACTORS Using distributed arithmetic is also beneficial when implementing the rational factors of the transform. The Factor 3/8 can be implemented as 3/8=1/4+1/8. ¼ implies a delay by 2 and 1/8 implies a delay by 3. Fig.6. 3/8 simulation. In=16. Out=6 (after a delay of 3 samples) Fig.7. 7/8 simulation. In=16, Out=14(after a delay of 3 samples) Fig. 4. Implementing factor 3/8 Similarly the 7/8 is implemented as 11/8 and the 11/16 is implemented as Fig.8. 11/16 simulation. In=16, Out=11 (after a delay of 4) All the building blocks are now put together to get the binDCT. The total number of transistors used in the forward transform is 3604 and the layout occupies an area of 2400x1400 . As delays are introduced due to the factors each branch of the output have different delays. The lowest branch has the maximum delay of 19. The delays are attributed to the larger denominator of the factors. The simulation of the binDCT implementation does match the expected results. E] TRANSPOSITIONAL MEMORY An SRAM is used as the on chip memory for the 64 coefficients. The initial 8bits expands to 11 bits and so the total size of on chip memory is 64x11 bits. The standard 6-transistor design is used as the single cell. The memory is then stacked up into 4 columns of 16 words each to efficiently use real estate on chip. Fig.10. Total SRAM – 1375 x 728 IV. CONCLUSION Fig.9. BinDCT simulation – outputs to be read after corresponding delays. The forward 1D binDCT has been implemented in hardware and simulations have been performed. Distributed arithmetic has been used to implement the forward transform and has shown to be convenient in implementing the rational factors. The design considers minimizing size at the expense of speed. All the subblocks of the chip have been layed out and have working simulations. The results exactly match the corresponding software results. The SRAM memory has been simulated and works. Further work has to be done with regard to the control and routing of the data form the input to the DCT, from the DCT to the memory, from the memory back to the DCT and then to the output. Specifications have to be then evaluated and compared to other existing DCT designs. V. REFERENCES [1] T.Tran, “The binDCT: fast multiplierless approximation of the dct”, IEEE Signal Processing Letters Vol.7, No.6, pp 141-145, Jun 2000.