Download VLSI Implementation of the binDCT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Holonomic brain theory wikipedia , lookup

Transcript
VLSI Implementation of the binDCT
Sachin Gangaputra
Department of Electrical and Computer Engineering
The Johns Hopkins University
Baltimore, MD 21218
Abstract - The Discrete Cosine Transform
(DCT) is the most widely used transform for
image compression. The binDCT [1] has
shown to be a promising alternative to the
DCT for its implementation simplicity, close
performance and compatibility to the DCT.
This paper describes a hardware realization
of the 2-D binDCT. The chip is to be
implemented on the 1.2 MOSIS process.
I. INTRODUCTION
The DCT is of significant interest
because of its use in image data
compression. In particular, it has been
widely recognized as the most effective
technique among the various transform
coding methods and various DCT have
been implemented. A binary cosine
transform denoted binDCT[1] has been
shown to be functionally compatible to
the DCT and performs as good as the
DCT. The binDCT requires a modest
amount of computations: 14 shifts and
31 additions per 8 input samples.
The binDCT has been tested in
software and its performance in terms of
speed and
accuracy has
been
benchmarked with that of the DCT.
This
paper
describes
a
VLSI
implementation of the binDCT to
evaluate its hardware performance with
respect to other DCT implementations in
terms of speed, power requirements and
sizes of the chip.
The design considered here is
that of minimizing chip size and thereby
distributed arithmetic is employed in
computing the DCT coefficients. The
DCT is a separable transform and hence
the 2D DCT is performed by first
performing the 1D DCT on the rows and
then on the resulting columns. This
technique, albeit requiring storage space
for intermediate coefficients, does
provide a more simplistic and efficient
design.
Fig. 1. The forward binary DCT
The rest of the paper is organized as
follows. In section II, the architecture of
the proposed chip is presented. Section
III, describes the implementations of the
individual blocks that build up the chip.
Section IV, provides concluding remarks
on this work.
II ARCHITECTURE
The binDCT chip can be divided into the
following functional blocks, namely, the
1D forward binDCT, the transposition
memory and the control circuitry.
The 1D forward DCT is
implemented using distributed arithmetic
in a parallel manner. Distributed
arithmetic would imply that initially the
lowest bit corresponding to all inputs is
processed, this results in the output
lowest bit. Then we move to higher bits
to generate the output bit stream. The
architecture of the 1D DCT would be
similar to the flow diagram as in fig.1.
Then the next 8 words are read form the
out put, this is then repeated till we read
in all the 64 input samples.
Appropriately 8 columns are read from
the memory and are converted back to
the bit-serial word-parallel format. 5.
This is then again passed through the 1D
DCT. The resulting bit streams is
converted to the word-serial bit-parallel
format and is written on to the output
bus.
The design of the of the forward
bin DCT requires a delay of 19 before
the least significant bit of the output is
generated. Assuming that the memory
read/write
operations
are
being
performed in parallel to the other DCT
coefficients being processed, we need
480 clock cycles for all the 64
coefficients to be generated.
III IMPLEMENTATION
Fig 2. The binDCT chip architecture.
The 2D transform is implemented by
first performing a 1D transform on all
the rows followed by a 1D transform on
the resulting columns. This would
require an on chip memory to store the
intermediate coefficients. The control
circuitry performs the timing and routing
operations of the bits from and to the
memory. The input is assumed to be in a
bit-parallel word-serial format.
The sequence of operations would be as
follows – The first 8 words are
converted form the bit-parallel, wordserial format to the word-parallel, bitserial format (this is required for the
distributed arithmetic) using serial to
parallel converters. The bit streams
through a shift register are fed to the 1D
forward DCT. The output of the 1D
DCT is then converted back to a word
parallel format, which is subsequently
written onto the transposition memory.
This section describes the
implementation of the chip. It outlines
the design and the simulations of the
various building blocks that make up the
binDCT chip.
A] LATCH
The latch provides a delay of a single
sample. The design is that of a masterslave flip-flop requiring two non
overlapping clocks.
B] FULL ADDER
The full adder serves as the basic block
to all our arithmetic computations.
C} 1 BIT ADDER & SUBTRACTOR
As we use distributed arithmetic to
compute the coefficients, a carry over
should be accounted for in the next bit
entering. Therefore a 1 bit adder is
implemented by latching the carryout
output of the full adder back to the carry
in.
1/16+1/8+1/2. The resulting delay of the
7/8 and the 3/8 implementations have a
delay of 3 samples wrt the input and the
11/16 implementation has a delay of 4
wrt the input.
Fig. 5. Implementing the factors
Subtraction is performed by 2’s
complement addition. The second input
to the full adder is inverted and the
initial carry is set to 1 by the
corresponding latch.
Fig.3.
Simulation
adder/subtracter
of
1
Other factors can be derived from these
basic factors by adding corresponding
delays. Eg. 7/16 can be implemented
using the 7/8 with another delay. For
equal delays on all parallel lines
corresponding delays have to be added
to the other paths.
bit
D] IMPLEMENTING FACTORS
Using distributed arithmetic is also
beneficial when implementing the
rational factors of the transform. The
Factor 3/8 can be implemented as
3/8=1/4+1/8. ¼ implies a delay by 2 and
1/8 implies a delay by 3.
Fig.6. 3/8 simulation. In=16. Out=6
(after a delay of 3 samples)
Fig.7.
7/8
simulation.
In=16,
Out=14(after a delay of 3 samples)
Fig. 4. Implementing factor 3/8
Similarly the 7/8 is implemented as 11/8 and the 11/16 is implemented as
Fig.8. 11/16 simulation. In=16, Out=11
(after a delay of 4)
All the building blocks are now put
together to get the binDCT. The total
number of transistors used in the forward
transform is 3604 and the layout
occupies an area of 2400x1400 . As
delays are introduced due to the factors
each branch of the output have different
delays. The lowest branch has the
maximum delay of 19. The delays are
attributed to the larger denominator of
the factors. The simulation of the
binDCT implementation does match the
expected results.
E] TRANSPOSITIONAL MEMORY
An SRAM is used as the on chip
memory for the 64 coefficients. The
initial 8bits expands to 11 bits and so the
total size of on chip memory is 64x11
bits. The standard 6-transistor design is
used as the single cell. The memory is
then stacked up into 4 columns of 16
words each to efficiently use real estate
on chip.
Fig.10. Total SRAM – 1375 x 728 
IV. CONCLUSION
Fig.9. BinDCT simulation – outputs to
be read after corresponding delays.
The forward 1D binDCT has been
implemented
in
hardware
and
simulations have been performed.
Distributed arithmetic has been used to
implement the forward transform and
has shown to be convenient in
implementing the rational factors. The
design considers minimizing size at the
expense of speed. All the subblocks of
the chip have been layed out and have
working simulations. The results exactly
match the corresponding software
results. The SRAM memory has been
simulated and works. Further work has
to be done with regard to the control and
routing of the data form the input to the
DCT, from the DCT to the memory,
from the memory back to the DCT and
then to the output. Specifications have to
be then evaluated and compared to other
existing DCT designs.
V. REFERENCES
[1] T.Tran, “The binDCT: fast
multiplierless approximation of the dct”,
IEEE Signal Processing Letters Vol.7,
No.6, pp 141-145, Jun 2000.