Download VLSI DESIGN OF CACHE COMPRESSION IN MICROPROCESSOR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
VLSI DESIGN OF CACHE COMPRESSION IN MICROPROCESSOR
USING BURROWS WHEELER TRANSFORM ALGORITHM
S.Gomathi 1, M.Nisha Angeline 2
. PG Student, Velalar College of Engineering and Technology
[email protected]
2
. Assitant Professor, Department of ECE, Velalar College of Engineering and Technology
[email protected]
1
Abstract
Microprocessors speeds have been
increasing faster than speed of off-chip memory. As
multiprocessors are used in the system design,
more processor requires more access to memory
.Thus it raises a “wall” between processor and
memory. Accessing off-chip memory acquires an
order of magnitude more time than accessing onchip cache, two orders of magnitude more time
than executing an instruction. Burrows-Wheeler
transform (BWT) has received special attention due
to its effectiveness in lossless data compression
algorithms. The BWT has a novel architecture
based on a parallel sorting block to implement the
transform. The existing system is based on pattern
matching and dictionary matching and if the
pattern matches, the dictionary matching has to be
bypassed and compression is not uniform. The
proposed system has number of novel features
tailored for the application. Improvements of
traditional merge sort and quick sort algorithms
have been proposed for software implementations
of BWT. The BWT algorithm is followed by run
length encoding in order to achieve the lossless
data compression. Decompression is based on
inverse BWT algorithm to retrieve the original data
with minimum error.
Key words: Burrows-Wheeler transform (BWT),
Move to front Transform, Run length encoding
(RLE), lossless data compression.
1. INTRODUCTION
Memory latencies have long been a
performance bottleneck in modern computers [7].
In fact, with each technology generation,
microprocessor
execution
rates
outpace
improvements in main memory latencies. Memory
latencies are thus having increasing impact on
overall processor performance. The rift between
processor and memory speeds is alleviated
primarily
by
using
caches.
Today’s
microprocessors have on-chip cache hierarchies
incorporating multiple megabytes of storage.
Increasing the size of on-chip caches can greatly
increase processor performance [7]. However, the
amount of on-chip storage cannot be increased
without bound. Caches already consume most of
the die area in high performance microprocessors.
The fabrication costs of larger die, and
ultimately limited by semiconductor manufacturing
technology. Practical cache sizes are constrained by
the increased Memory bandwidth is also a scarce
resource in high-performance systems. Several
researches used hardware based compression to
increase effective memory size, reduce memory
address and bandwidth and increase effective cache
size [7].
This paper addresses the increasingly
important
issue
of
controlling
off-chip
communication in computer systems in order to
maintain good performance and energy efficiency.
The ongoing move to chip-level multiprocessors
(CMPs) is further increasing the problem; more
processors require more accesses to memory, but
the performance of the processor-memory bus is
not keeping pace. Techniques that reduce off-chip
communication without degrading the performance
have the potential to solve this problem. Cache
compression is one such technique [7]. Cache
compression presents several challenges. First,
decompression and Compression must be
extremely fast: a significant increase in cache hit
latency will overwhelm the advantages of reduced
cache miss rate. Then Secondly the hardware
should occupy little area compared to the
corresponding decrease in the physical size of the
cache, and should not substantially increase the
total chip power consumption. Finally, cache
compression
should
not
increase
power
consumption substantially. The above requirements
prevent the use of high-overhead compression
algorithms .A faster and lower-overhead technique
is required.
2. EXISTING METHODS
2.1 Frequent Pattern Compression
This compression scheme builds on
significance-based compression schemes. It is also
based on the observation that some data patterns
are frequent and also compressible to a fewer
number of bits[1]. For example, many small-value
integers can be stored in 4, 8 or 16 bits, but are
normally stored in a full 32-bit word (or 64-bits for
64-bit architectures). These values are frequent
enough to merit special treatment, and storing them
in a more compact form can increase the cache
capacity. In addition, special treatment is also given
to runs of zeros since they are very frequent. The
insight behind FPC is that we want to get most of
the benefits of dictionary based schemes, while
keeping the per - line overhead at a minimum. The
FPC compresses [1] on a cache line basis. Each
cache line is divided into 32-bit words (e.g., 16
words for a 64-byte line).These patterns are: a zero
run (one or more all-zero words), 4-bit signextended (including one-word zero runs), one byte
sign-extended, one half words sign-extended, one
half word padded with a zero half word, two byte
sign-extended half words, and word consisting of
repeated bytes (e.g. “0x20202020”, or similar
patterns that can be used for data initialization).
These patterns are selected based on their high
frequency in many of our integer and commercial
benchmarks. A word that doesn’t match any of
these categories is stored in its original 32-bit
format.
Frequent pattern compression, compresses
cache lines at the L2 level by storing common word
patterns in a compressed format. Patterns are
differentiated by a 3-bit prefix. Cache lines are
compressed to predetermined sizes that never
exceed their original size to reduce decompression
overhead. The drawback of this method is that
there is no register-transfer-level hardware
implementation [1] or FPGA implementation of
FPC, and therefore its exact performance, power
consumption, and area overheads are unknown.
2.2 Restrictive Compression Technique
With the CMOS scaling trends and slow
scaling of wires as compared to the transistors, the
cache access latencies [9] will increase in the future
microprocessors. To prevent the increasing
latencies from affecting the cache throughput, the
L1 caches are small-sized and their accesses are
pipelined. small-sized L1 data caches can result in
significant performance degradation due to
increased miss rates. Compression techniques can
be used to increase the L1 data cache capacity.
However, these compression techniques cannot
alter the byte-offset of the memory reference, to
avoid any increase in the cache access latency.
Restrictive cache compression techniques do not
require updates to the byte-offset, and hence result
in minimal, if any, cache access latency impact.
The basic technique AWN [9] compresses a cache
block only if all the words in the cache block are of
small size. The compressed cache blocks are then
packed together in a single physical cache block.
The AWN technique requires minimal additional
storage in the cache and results in a 20% increase
in the cache capacity [9].
The AWN technique is extended by providing
some additional space for the upper half-word AHS
of a few normal-sized words in a cache block, with
an aim to convert them into narrow blocks. Further
the AHS technique is extended to AAHS so that the
number of upper half-words used by a cache block
can vary depending on the requirement. AHS and
AAHS techniques increase the cache capacity by
about 50%, while incurring a 38% increase in the
storage space required, compared to a conventional
cache. However, these techniques still do not
impact the cache access latency. To reduce the
additional tag requirements (which is inevitable
with any cache compression technique), it is
proposed to reduce the number of additional tag
bits provided for the additional cache blocks to be
packed in a physical cache block, reducing the
overhead of the AHS techniques to about 30%.
The byte-offset of each word in the cache
block will depend on the size of the words before
it. This will require recalculating the byte-offset to
read a word from the block [9]. Therefore, it is
imperative that any compression technique that is
applied to L1 caches should not require updates to
the byte-offset. Such compression techniques can
be called as restrictive compression techniques.
The drawback of this technique is that it cannot
alter the byte-offset of the memory reference, to
avoid any increase in the cache access latency.
2.3 Indirect Index Cache with Compression
Indirect index cache allocates variable
amount of storage to different blocks, depending on
their compressibility. The Indirect Index Cache
with Compression is based on the IIC [3]. The
basic IIC consists of a data array containing the
cache blocks and a tag store which contains the
tags for these blocks. Each IIC tag entry holds a
pointer to the data block with which it is currently
associated. This indirection provides the ability to
implement a fully associative cache.
Replacements in the IIC are managed by a
software algorithm running on an embedded
controller or as a thread on the main CPU. Our
primary
algorithm,
called
Generational
Replacement (GEN), maintains prioritized pools
(queues) of blocks;[3] periodically, referenced
blocks are moved to higher priority pools and
unreferenced blocks are moved to lower priority
pools. Replacements are selected from the
unreferenced blocks in the lowest priority pool. To
reach the highest priority pool, a block must be
referenced regularly over an extended period of
time; once there, it must remain unreferenced for a
similarly long period to return to the lowest priority
pool. This algorithm thus combines reference
regency and frequency information with a
hysteresis effect, while relying only on block
reference bits and periodic data structure updates.
To ensure that adequate replacement candidates are
available to deal with bursts of misses, the
algorithm can identify multiple candidates per
invocation and maintain a small pool of
replacement blocks. These blocks are used by
hardware to handle incoming responses, while
GEN works in the background to keep this pool at a
predetermined level. The Figure.1 illustrates the
pipelined data cache read with byte-offset
adjustment. The IIC [3] was originally designed to
utilize large on-chip caches to achieve better
performance over traditional LRU caches. The
main drawback of this technique is that one cannot
reliably determine whether the architectural
schemes are beneficial without, a cache
compression
algorithm
and
hardware
implementation are designed and evaluated for
effective system-wide compression ratio, hardware
overheads, and interaction with other portions of
the cache compression system.
2.4 Selective Compression Technique
The Selective compression means that a data
block is compressed only if its compression ratio is
less than a specific compression threshold value.
This technique can reduce the decompression
overhead because decompression process is
required only for the compressed data blocks. In
addition, data expansion problem can be easily
solved [4] by compressing only well-compressible
blocks. As the threshold value becomes lower, the
average compression ratio of whole data blocks
increases, so that the compression benefits may be
reduced. However, the measurement of tradeoff
between decompression time and compression ratio
over the variation of threshold value shows that the
selective compression technique [4] can provide
performance advantage by reducing the impact of
decompression overhead without meaningful loss
of compression ratio.
The main drawback of this technique is that
one cannot reliably determine whether the
architectural schemes are beneficial without, a
cache compression algorithm and hardware
implementation are designed and evaluated for
effective system-wide compression ratio [4],
hardware overheads, and interaction with other
portions of the cache compression system.
2.5 X-Match Compression Algorithm
X-Match is a dictionary-based compression
algorithm that has been implemented on an FPGA.
It matches 32-bit words using a content addressable
memory that allows partial matching with
dictionary entries and outputs variable-size
encoded data that depends on the type of match. To
improve coding efficiency, it also uses a move-tofront coding strategy [8] and represents smaller
indexes with fewer bits. Although appropriate for
compressing main memory, such hardware usually
has a very large block size (1 KB for MXT and up
to 32 KB for X-Match), which is inappropriate for
compressing cache lines. It is shown that for XMatch and two variants of Lempel-Ziv algorithm,
i.e., LZ1 and LZ2, the compression ratio for
memory data deteriorates as the block size becomes
smaller For example, when the block size decreases
from 1KBto 256 B, the compression ratio for LZ1
and X-Match [4] increase by 11% and 3%. It can
be inferred that the amount of increase in
compression ratio could be even larger when the
block size decreases from 256 B to 64 B. In
addition, such hardware has performance, area, or
power consumption costs that contradict its use in
cache compression [8]. For example, if the MXT
hardware were scaled to a 65 nm fabrication
process and integrated within a 1 GHz processor,
the decompression latency would be 16 processor
cycles, about twice the normal L2 cache hit latency.
2.6 C-Pack Compression Algorithm
C-Pack targets on-chip cache compression. It
permits a good compression ratio even when used
on small cache lines[8]. The performance, area, and
power consumption overheads are low enough for
practical use. This contrasts with other schemes
such as X-match which require complicated
hardware to achieve an equivalent effective systemwide compression ratio. Prior work in cache
compression does not adequately evaluate the
overheads imposed by the assumed cache
compression algorithms. C-Pack is twice as fast as
the best existing hardware implementations
potentially suitable for cache compression. For
FPC to match this performance, it would require at
least 8* the area of C-Pack [8]. The main
drawbacks of this technique are that Compression
is not performed in all the cases. When the total
number of compressed bits exceeds the
uncompressed line size, the content of the backup
buffer is selected. Here the backup buffer has the
actual 64 bit uncompressed word. Compression is
made in the ratio of 2:1.Due to its efficiency it
cannot be expanded. Frequently used instructions
are compressed and stored in FIFO [8]. When that
particular instruction leaves the dictionary it reenters the dictionary in uncompressed form and it
has to be again compressed. Hence the memory is
not properly used.
2.7 P-Match Compression Algorithm
The algorithm is based on pattern matching
and partial dictionary coding. Its hardware
implementation permits parallel compression of
multiple words without degradation of dictionary
match probability. Although the existing hardware
implementation [7] mainly targets online cache
compression, it can also be used in other highperformance lossless data compression [7]
applications with few or no modifications. The
hardware implementation targets on the
compression with the compression ratio of 2:1 and
4:1[7].
The algorithm P-Match achieve compression by
two means:
1. It uses compact encodings for all data words.
(Dictionary Matching Unit)
2. Selecting the output according to the priority of
the bit patterns. (Priority Selection Unit)
3. PROPOSED METHOD
Several data compression algorithms have
been developed based on the BWT transform [5].
By using BWT in the data compression process it is
possible to obtain compression ratios close to the
best statistical compressors, i.e. The Prediction by
partial matching (PPM) [5] family of
algorithms.BWT based compression algorithms
have also to be more efficient in terms of
computational resources and memory use than
PPM type algorithm, however implementations of
compression algorithm based on BWT, either in
hardware or software, still demand large amounts
of memory resources and computational power.
Hardware implementations of BWT require a
custom-built storage matrix capable of performing
shifts and rotations of the input string. The matrix
should also allow performing lexicographical
sorting of its contents.
Improvements of traditional merge sort and
quick sort algorithms have been proposed for
software implementations of BWT. A suffix list
data structure [5] in proposed leads to anti
sequential and memory efficient algorithms, the
authors also describe a possible architecture to a
BWT-based compression system in VLSI.
3.1 Compared to P-Match Compression
Algorithm
The proposed method provides lossless
compression algorithm with good compression
ratio and we present a novel architecture based on a
parallel sorting block to implement the BWT
transform and the run length encoding is used to
encode the data (typically a stream of bytes) which
improve the performance of the compression. The
performance of decompression in proposed method
is efficient compare to P-Match Compression
Algorithm.
3.2 Block Diagram of Proposed System
The proposed method has a novel architecture
based on a parallel sorting algorithm to implement
the BWT transform in order to retrieve the data
with minimum error. At first, the input bits (64
bits) are splitted into 8bits (8x8=64 bits) and it is
given to the BWT algorithm. The BWT algorithm
is based on parallel sorting strategy, The parallel
sorting strategy which includes comparison and
swapping strategies.
The wavesorter was applied instead of
parallel sorting strategy and it consists of a group
of slightly modified bidirectional shift registers that
can perform shift-right and shift-left operations to
store their actual value to the next left or right
adjacent register. These registers are grouped into
pairs by a comparator block that swaps them if a
‘less than’ condition is met. The architecture reads
the input string, character by character, from
memory and stores each character in the
wavesorter. Every time a character is read, a shiftright operation is performed followed by a
comparison operation. When all the characters
from the input string are read, a partial sort of the
string is found in the registers. To get a complete
sort, the string is shifted out by changing the shift
direction to the left. When the last character exits,
the output string is fully sorted. To make a fair
comparison of the parallel sorting strategy against
wavesorter strategy interms of the total number of
required steps to sort an array of data bits (8x8=64
bits),it is necessary to consider the steps required to
store the sorted data back to memory. By using
BWT, the input data bits are segregated into eight
array of 8-bit data and the same data bits are sorted
&stored in the memory.
The output of BWT is given to the run length
encoding. The run length encoding or move-tofront transform (or MTF) is an encoding of data
(typically a stream of bytes) designed to improve
the performance of entropy encoding techniques of
compression. Inverse BWT is used to retrieve the
original data from compressed data bits. Thus the
decompression is achieved through inverse BWT
and run length encoding.
Input in 64 Bits
(001100110011001100110011001100
1100110011001100110011001100110
011)
data, the number of steps required for a sorting
iteration is n-1. This number of steps can even be
improved [5] by connecting the comparators used
with the odd adjacency registers to the comparators
used with the even adjacency registers, as shown in
Figure 3.2..Then, the total number of steps for
sorting [5] the array is at most n/2. This sorting
strategy will be referred as Parallel sorting strategy.
Split into 8 bit (8x8bit=64 bit)
Burrows–Wheeler transform
Run-length encoding
Fig.3.2 Parallel sorting with two levels of
comparators performed in one iteration.
Compressed Output based on input
Inverse BWT
Inverse Run Length Encoding
Decompressed Output
Fig.3.1 Block Diagram of Proposed Method
By using BWT in the data compression
process it is possible to obtain compression ratios
close to the best statistical compressors.
Improvements of traditional merge sort and quick
sort algorithms [5] have been proposed for software
implementations of BWT. For example, In figure
2.7 dotted lines point out that all comparisons and
swaps performed to the registers are performed in
parallel.
In this way, it is possible to perform parallel
comparisons and swaps by following the order in
which the registers are compared, first odd registers
and then even registers. Thus, for an array of n
To make a fair comparison of the parallel
sorting strategy against wavesorter strategy [5] in
terms of the total number of required steps to sort
an array, it is necessary to consider the steps used
to read data from memory and the steps required to
store the sorted data back to memory. The proposed
approach is based on the same structure of the
registers array used in the wavesorter strategy.
With this kind of array, data can be stored in the
array by sending a datum to the first register and
later, when the second datum is sent to the first
register, the value on the first array is shifted to the
second register. Thus, for every datum sent to the
array to be stored, values in registers are shifted to
their respective adjacent registers. This process
requires n steps [5]. The same number of steps is
required to take data out from the array. This
approach allows storing a new set of data in the
array while the previous set is being sent back into
the memory.
Suffix sorting might imply more than one
sorting iterations. If k sorts are required, then the
parallel sorting requires to ((n+n/2) *k + n) to sort
an array of n data [5]. Thus total number of steps
required can be obtained by the following equation:
fpssteps(n,k)=n((3/2)k+1
----------------- (1)
The parallel strategy leads to a significant
reduction of about 30% compared to the wavesorter
approach. Furthermore, in additional sorts the
necessary number of steps for sorting is equal to
the number of characters in the biggest group of
identical characters [5] divided by 2 (remember
that an additional sorting is implied if groups of
identical adjacent characters appear in the array).
This implies that in practice, it is possible to reduce
more than 30 % the number of steps to solve the
suffix problem.
3.3 Run Length Encoding
The move-to-front transform (or MTF) is an
encoding of data (typically a stream of bytes)
designed to improve the performance of entropy
encoding techniques of compression. When
efficiently implemented, it is fast enough that its
benefits usually justify including it as an extra step
in data compression algorithms.
The main idea is that each symbol in the
data is replaced by its index in the stack of
“recently used symbols”. For example, long
sequences of identical symbols are replaced by as
many zeroes, whereas when a symbol that has not
been used in a long time appears, it is replaced with
a large number. Thus at the end the data is
transformed into a sequence of integers; if the data
exhibits a lot of local correlations, then these
integers tend to be small.
4 .RESULTS AND DISCUSSIONS
We have simulated our design using
Model Sim version 6.5.We have simulated two
outputs. One shows the BWT transform output
shown in fig.4.1,while others are compressed and
decompressed output as shown in fig.4.2 & 4.3.
Fig 4.3 Simulation Result of Decompression
5. CONCLUSION
We have presented architecture that implements
the BWT transform based on a parallel sorting
block. The proposed architecture targets on the
compression with the compression ratio of 4:1.
This task includes the solution of the suffix sorting
problem and the generation of the output string
corresponding to the BWT. The proposed
architecture can be scalable to more than 128
characters without significant changes on the
control and sorting blocks.
A full implementation of a BWT
compression algorithm is under way; this involves
the integration of the proposed architecture with
Move to Front, Run Length Encoding and Entropy
Encoder blocks.
6. SCOPE FOR FURTHER RESEARCH
Future work includes the implementation
of a new storing strategy that allows a reduction in
the latency associated with sequential access of
data. Also, pipeline registers can be between the
two levels of comparators to reduce the critical path
delay and thus, increase the maximum frequency.
Fig.4.1 Simulation Result for Burrows–Wheeler
Transform
Fig. 4.2 Simulation Result for Compression
7. ACKNOWLEDGMENT
The
authors
acknowledge
the
contributions of the students, faculty of Velalar
College of Engineering and Technology for
helping in the design of compression circuitry,
and for tool support. The authors also thank the
anonymous
reviewers for
their
thoughtful
comments
that
helped to improve this
paper.The authors would like to thank the
anonymous reviewers for their constructive critique
from which this paper greatly benefited..
8.REFERENCES
[1]
A. Alameldeen and D. A. Wood, (2004)
“Frequent
pattern
compression:
A
significance-based compression scheme for 12
caches,” Dept. Comp. Scie. , Univ. WisconsinMadison, Tech. Rep. 1500.
[2]
A.Deepa, M.Nisha Angeline and C.N.
Marimuthu,(2011)
“
P-Match:
A
Microprocessor
Cache
Compresion
Algorithm”, 2nd International Conference on
Intelligent
Information
Systems
and
Management (IISM’11), July 14- 16., P.No.98,
[3]
E.G.Hallnor and S.K.Reinhardt, (2004) “A
compressed memory hierarchy using an
indirect index cache,” in Proc. Workshop
Memory Performance Issues, pp. 9–15.
[4]
J.-S. Lee et al., (1999) “Design and evaluation
of a selective compressed memory system,” in
Proc. Int. Conf. Computer Design, pp. 1
[5]
José Martínez, René Cumplido, Claudia
Feregrino,( 2005)”An FPGA-based Parallel
Sorting Architecture for the Burrows Wheeler
Transform”, Proceedings of the 2005
International Conference on Reconfigurable
Computing and FPGAs (ReConFig 2005) 07695-2456-7/05-$20.00 © 2005 IEEE.
[6]
Ming-Bo Lin, Member, IEEE, Jang-Feng Lee,
and Gene Eu Jan (SEPTEMBER 2006)“A
Lossless
Data
Compression
and
Decompression Algorithm and Its Hardware
Architecture”, IEEE Transactions on Very
Large Scale Integration (VLSI) systems, vol.
14,no.9,
[7]
M.Nisha
Angeline,
S.K.Manikandan,and
Prof.S.Valarmathy(, July-August 2012) ‘‘VLSI
Design Of Cache Compression in Micro
Processor using Pattern Matching Technique’’
in IOSR journal of Electronics and
Communication Engineering, Vol 1,Issue 6.
[8]
Pujara P., and Aggarwal A., “Restrictive
compression techniques to increase level 1
cache capacity”, Proc. Int. Conf. Computer
Design, pp. 327–333.
[9]
Xi Chen, Lei Yang, Robert P. Dick (August
2010) “C-Pack: A High – performance
Microprocessor
Cache
Compression
Algorithm. IEEE Transactions On Very Large
Scale Integration (VLSI) Systems, Vol.18, No.
8.
9. AUTHOR’S PROFILE
Gomathi.S has received B.E degree in Electronics
and Communication Engineering from Government
College of Engineering, Bargur under Anna
University, Chennai in 2005. She was working as a
Lecturer in the academic year of 2006- 2012. She
was delivered a lecture in AICTE sponsored Staff
development Programme and She has participated
in various National workshops and seminars. She
is currently pursuing Master of Engineering in
VLSI Design at Velalar College of Engineering and
Technology under Anna University, Chennai. Her
areas of interest in research are VLSI Signal
Processing, VLSI Design Techniques.
Nisha Angeline. M has received B.E degree in
Electronics and Instrumentation Engineering from
the Indian Engineering College, Thirunelveli
District in 2003 and M.E –VLSI Design in Kongu
Engineering College, Perundurai in the year 2006.
She received first rank in PG course for her
academic performance. She is working as a
Professor (Sr.Gr) in the department of ECE,
Velalar College of Engineering and Technology,
Erode. Currently she is doing Ph.D under Anna
University in the area of VLSI Design. She has
published three papers in International Journals and
presented five papers in National and International
Conferences. Her areas of interest are Digital
Electronics, VLSI Architecture, and VLSI Signal
Processing.