Download VLSI DESIGN OF CACHE COMPRESSION IN MICROPROCESSOR

VLSI DESIGN OF CACHE COMPRESSION IN MICROPROCESSOR USING BURROWS WHEELER TRANSFORM ALGORITHM S.Gomathi 1, M.Nisha Angeline 2 . PG Student, Velalar College of Engineering and Technology [email protected] 2 . Assitant Professor, Department of ECE, Velalar College of Engineering and Technology [email protected] 1 Abstract Microprocessors speeds have been increasing faster than speed of off-chip memory. As multiprocessors are used in the system design, more processor requires more access to memory .Thus it raises a “wall” between processor and memory. Accessing off-chip memory acquires an order of magnitude more time than accessing onchip cache, two orders of magnitude more time than executing an instruction. Burrows-Wheeler transform (BWT) has received special attention due to its effectiveness in lossless data compression algorithms. The BWT has a novel architecture based on a parallel sorting block to implement the transform. The existing system is based on pattern matching and dictionary matching and if the pattern matches, the dictionary matching has to be bypassed and compression is not uniform. The proposed system has number of novel features tailored for the application. Improvements of traditional merge sort and quick sort algorithms have been proposed for software implementations of BWT. The BWT algorithm is followed by run length encoding in order to achieve the lossless data compression. Decompression is based on inverse BWT algorithm to retrieve the original data with minimum error. Key words: Burrows-Wheeler transform (BWT), Move to front Transform, Run length encoding (RLE), lossless data compression. 1. INTRODUCTION Memory latencies have long been a performance bottleneck in modern computers [7]. In fact, with each technology generation, microprocessor execution rates outpace improvements in main memory latencies. Memory latencies are thus having increasing impact on overall processor performance. The rift between processor and memory speeds is alleviated primarily by using caches. Today’s microprocessors have on-chip cache hierarchies incorporating multiple megabytes of storage. Increasing the size of on-chip caches can greatly increase processor performance [7]. However, the amount of on-chip storage cannot be increased without bound. Caches already consume most of the die area in high performance microprocessors. The fabrication costs of larger die, and ultimately limited by semiconductor manufacturing technology. Practical cache sizes are constrained by the increased Memory bandwidth is also a scarce resource in high-performance systems. Several researches used hardware based compression to increase effective memory size, reduce memory address and bandwidth and increase effective cache size [7]. This paper addresses the increasingly important issue of controlling off-chip communication in computer systems in order to maintain good performance and energy efficiency. The ongoing move to chip-level multiprocessors (CMPs) is further increasing the problem; more processors require more accesses to memory, but the performance of the processor-memory bus is not keeping pace. Techniques that reduce off-chip communication without degrading the performance have the potential to solve this problem. Cache compression is one such technique [7]. Cache compression presents several challenges. First, decompression and Compression must be extremely fast: a significant increase in cache hit latency will overwhelm the advantages of reduced cache miss rate. Then Secondly the hardware should occupy little area compared to the corresponding decrease in the physical size of the cache, and should not substantially increase the total chip power consumption. Finally, cache compression should not increase power consumption substantially. The above requirements prevent the use of high-overhead compression algorithms .A faster and lower-overhead technique is required. 2. EXISTING METHODS 2.1 Frequent Pattern Compression This compression scheme builds on significance-based compression schemes. It is also based on the observation that some data patterns are frequent and also compressible to a fewer number of bits[1]. For example, many small-value integers can be stored in 4, 8 or 16 bits, but are normally stored in a full 32-bit word (or 64-bits for 64-bit architectures). These values are frequent enough to merit special treatment, and storing them in a more compact form can increase the cache capacity. In addition, special treatment is also given to runs of zeros since they are very frequent. The insight behind FPC is that we want to get most of the benefits of dictionary based schemes, while keeping the per - line overhead at a minimum. The FPC compresses [1] on a cache line basis. Each cache line is divided into 32-bit words (e.g., 16 words for a 64-byte line).These patterns are: a zero run (one or more all-zero words), 4-bit signextended (including one-word zero runs), one byte sign-extended, one half words sign-extended, one half word padded with a zero half word, two byte sign-extended half words, and word consisting of repeated bytes (e.g. “0x20202020”, or similar patterns that can be used for data initialization). These patterns are selected based on their high frequency in many of our integer and commercial benchmarks. A word that doesn’t match any of these categories is stored in its original 32-bit format. Frequent pattern compression, compresses cache lines at the L2 level by storing common word patterns in a compressed format. Patterns are differentiated by a 3-bit prefix. Cache lines are compressed to predetermined sizes that never exceed their original size to reduce decompression overhead. The drawback of this method is that there is no register-transfer-level hardware implementation [1] or FPGA implementation of FPC, and therefore its exact performance, power consumption, and area overheads are unknown. 2.2 Restrictive Compression Technique With the CMOS scaling trends and slow scaling of wires as compared to the transistors, the cache access latencies [9] will increase in the future microprocessors. To prevent the increasing latencies from affecting the cache throughput, the L1 caches are small-sized and their accesses are pipelined. small-sized L1 data caches can result in significant performance degradation due to increased miss rates. Compression techniques can be used to increase the L1 data cache capacity. However, these compression techniques cannot alter the byte-offset of the memory reference, to avoid any increase in the cache access latency. Restrictive cache compression techniques do not require updates to the byte-offset, and hence result in minimal, if any, cache access latency impact. The basic technique AWN [9] compresses a cache block only if all the words in the cache block are of small size. The compressed cache blocks are then packed together in a single physical cache block. The AWN technique requires minimal additional storage in the cache and results in a 20% increase in the cache capacity [9]. The AWN technique is extended by providing some additional space for the upper half-word AHS of a few normal-sized words in a cache block, with an aim to convert them into narrow blocks. Further the AHS technique is extended to AAHS so that the number of upper half-words used by a cache block can vary depending on the requirement. AHS and AAHS techniques increase the cache capacity by about 50%, while incurring a 38% increase in the storage space required, compared to a conventional cache. However, these techniques still do not impact the cache access latency. To reduce the additional tag requirements (which is inevitable with any cache compression technique), it is proposed to reduce the number of additional tag bits provided for the additional cache blocks to be packed in a physical cache block, reducing the overhead of the AHS techniques to about 30%. The byte-offset of each word in the cache block will depend on the size of the words before it. This will require recalculating the byte-offset to read a word from the block [9]. Therefore, it is imperative that any compression technique that is applied to L1 caches should not require updates to the byte-offset. Such compression techniques can be called as restrictive compression techniques. The drawback of this technique is that it cannot alter the byte-offset of the memory reference, to avoid any increase in the cache access latency. 2.3 Indirect Index Cache with Compression Indirect index cache allocates variable amount of storage to different blocks, depending on their compressibility. The Indirect Index Cache with Compression is based on the IIC [3]. The basic IIC consists of a data array containing the cache blocks and a tag store which contains the tags for these blocks. Each IIC tag entry holds a pointer to the data block with which it is currently associated. This indirection provides the ability to implement a fully associative cache. Replacements in the IIC are managed by a software algorithm running on an embedded controller or as a thread on the main CPU. Our primary algorithm, called Generational Replacement (GEN), maintains prioritized pools (queues) of blocks;[3] periodically, referenced blocks are moved to higher priority pools and unreferenced blocks are moved to lower priority pools. Replacements are selected from the unreferenced blocks in the lowest priority pool. To reach the highest priority pool, a block must be referenced regularly over an extended period of time; once there, it must remain unreferenced for a similarly long period to return to the lowest priority pool. This algorithm thus combines reference regency and frequency information with a hysteresis effect, while relying only on block reference bits and periodic data structure updates. To ensure that adequate replacement candidates are available to deal with bursts of misses, the algorithm can identify multiple candidates per invocation and maintain a small pool of replacement blocks. These blocks are used by hardware to handle incoming responses, while GEN works in the background to keep this pool at a predetermined level. The Figure.1 illustrates the pipelined data cache read with byte-offset adjustment. The IIC [3] was originally designed to utilize large on-chip caches to achieve better performance over traditional LRU caches. The main drawback of this technique is that one cannot reliably determine whether the architectural schemes are beneficial without, a cache compression algorithm and hardware implementation are designed and evaluated for effective system-wide compression ratio, hardware overheads, and interaction with other portions of the cache compression system. 2.4 Selective Compression Technique The Selective compression means that a data block is compressed only if its compression ratio is less than a specific compression threshold value. This technique can reduce the decompression overhead because decompression process is required only for the compressed data blocks. In addition, data expansion problem can be easily solved [4] by compressing only well-compressible blocks. As the threshold value becomes lower, the average compression ratio of whole data blocks increases, so that the compression benefits may be reduced. However, the measurement of tradeoff between decompression time and compression ratio over the variation of threshold value shows that the selective compression technique [4] can provide performance advantage by reducing the impact of decompression overhead without meaningful loss of compression ratio. The main drawback of this technique is that one cannot reliably determine whether the architectural schemes are beneficial without, a cache compression algorithm and hardware implementation are designed and evaluated for effective system-wide compression ratio [4], hardware overheads, and interaction with other portions of the cache compression system. 2.5 X-Match Compression Algorithm X-Match is a dictionary-based compression algorithm that has been implemented on an FPGA. It matches 32-bit words using a content addressable memory that allows partial matching with dictionary entries and outputs variable-size encoded data that depends on the type of match. To improve coding efficiency, it also uses a move-tofront coding strategy [8] and represents smaller indexes with fewer bits. Although appropriate for compressing main memory, such hardware usually has a very large block size (1 KB for MXT and up to 32 KB for X-Match), which is inappropriate for compressing cache lines. It is shown that for XMatch and two variants of Lempel-Ziv algorithm, i.e., LZ1 and LZ2, the compression ratio for memory data deteriorates as the block size becomes smaller For example, when the block size decreases from 1KBto 256 B, the compression ratio for LZ1 and X-Match [4] increase by 11% and 3%. It can be inferred that the amount of increase in compression ratio could be even larger when the block size decreases from 256 B to 64 B. In addition, such hardware has performance, area, or power consumption costs that contradict its use in cache compression [8]. For example, if the MXT hardware were scaled to a 65 nm fabrication process and integrated within a 1 GHz processor, the decompression latency would be 16 processor cycles, about twice the normal L2 cache hit latency. 2.6 C-Pack Compression Algorithm C-Pack targets on-chip cache compression. It permits a good compression ratio even when used on small cache lines[8]. The performance, area, and power consumption overheads are low enough for practical use. This contrasts with other schemes such as X-match which require complicated hardware to achieve an equivalent effective systemwide compression ratio. Prior work in cache compression does not adequately evaluate the overheads imposed by the assumed cache compression algorithms. C-Pack is twice as fast as the best existing hardware implementations potentially suitable for cache compression. For FPC to match this performance, it would require at least 8* the area of C-Pack [8]. The main drawbacks of this technique are that Compression is not performed in all the cases. When the total number of compressed bits exceeds the uncompressed line size, the content of the backup buffer is selected. Here the backup buffer has the actual 64 bit uncompressed word. Compression is made in the ratio of 2:1.Due to its efficiency it cannot be expanded. Frequently used instructions are compressed and stored in FIFO [8]. When that particular instruction leaves the dictionary it reenters the dictionary in uncompressed form and it has to be again compressed. Hence the memory is not properly used. 2.7 P-Match Compression Algorithm The algorithm is based on pattern matching and partial dictionary coding. Its hardware implementation permits parallel compression of multiple words without degradation of dictionary match probability. Although the existing hardware implementation [7] mainly targets online cache compression, it can also be used in other highperformance lossless data compression [7] applications with few or no modifications. The hardware implementation targets on the compression with the compression ratio of 2:1 and 4:1[7]. The algorithm P-Match achieve compression by two means: 1. It uses compact encodings for all data words. (Dictionary Matching Unit) 2. Selecting the output according to the priority of the bit patterns. (Priority Selection Unit) 3. PROPOSED METHOD Several data compression algorithms have been developed based on the BWT transform [5]. By using BWT in the data compression process it is possible to obtain compression ratios close to the best statistical compressors, i.e. The Prediction by partial matching (PPM) [5] family of algorithms.BWT based compression algorithms have also to be more efficient in terms of computational resources and memory use than PPM type algorithm, however implementations of compression algorithm based on BWT, either in hardware or software, still demand large amounts of memory resources and computational power. Hardware implementations of BWT require a custom-built storage matrix capable of performing shifts and rotations of the input string. The matrix should also allow performing lexicographical sorting of its contents. Improvements of traditional merge sort and quick sort algorithms have been proposed for software implementations of BWT. A suffix list data structure [5] in proposed leads to anti sequential and memory efficient algorithms, the authors also describe a possible architecture to a BWT-based compression system in VLSI. 3.1 Compared to P-Match Compression Algorithm The proposed method provides lossless compression algorithm with good compression ratio and we present a novel architecture based on a parallel sorting block to implement the BWT transform and the run length encoding is used to encode the data (typically a stream of bytes) which improve the performance of the compression. The performance of decompression in proposed method is efficient compare to P-Match Compression Algorithm. 3.2 Block Diagram of Proposed System The proposed method has a novel architecture based on a parallel sorting algorithm to implement the BWT transform in order to retrieve the data with minimum error. At first, the input bits (64 bits) are splitted into 8bits (8x8=64 bits) and it is given to the BWT algorithm. The BWT algorithm is based on parallel sorting strategy, The parallel sorting strategy which includes comparison and swapping strategies. The wavesorter was applied instead of parallel sorting strategy and it consists of a group of slightly modified bidirectional shift registers that can perform shift-right and shift-left operations to store their actual value to the next left or right adjacent register. These registers are grouped into pairs by a comparator block that swaps them if a ‘less than’ condition is met. The architecture reads the input string, character by character, from memory and stores each character in the wavesorter. Every time a character is read, a shiftright operation is performed followed by a comparison operation. When all the characters from the input string are read, a partial sort of the string is found in the registers. To get a complete sort, the string is shifted out by changing the shift direction to the left. When the last character exits, the output string is fully sorted. To make a fair comparison of the parallel sorting strategy against wavesorter strategy interms of the total number of required steps to sort an array of data bits (8x8=64 bits),it is necessary to consider the steps required to store the sorted data back to memory. By using BWT, the input data bits are segregated into eight array of 8-bit data and the same data bits are sorted &stored in the memory. The output of BWT is given to the run length encoding. The run length encoding or move-tofront transform (or MTF) is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. Inverse BWT is used to retrieve the original data from compressed data bits. Thus the decompression is achieved through inverse BWT and run length encoding. Input in 64 Bits (001100110011001100110011001100 1100110011001100110011001100110 011) data, the number of steps required for a sorting iteration is n-1. This number of steps can even be improved [5] by connecting the comparators used with the odd adjacency registers to the comparators used with the even adjacency registers, as shown in Figure 3.2..Then, the total number of steps for sorting [5] the array is at most n/2. This sorting strategy will be referred as Parallel sorting strategy. Split into 8 bit (8x8bit=64 bit) Burrows–Wheeler transform Run-length encoding Fig.3.2 Parallel sorting with two levels of comparators performed in one iteration. Compressed Output based on input Inverse BWT Inverse Run Length Encoding Decompressed Output Fig.3.1 Block Diagram of Proposed Method By using BWT in the data compression process it is possible to obtain compression ratios close to the best statistical compressors. Improvements of traditional merge sort and quick sort algorithms [5] have been proposed for software implementations of BWT. For example, In figure 2.7 dotted lines point out that all comparisons and swaps performed to the registers are performed in parallel. In this way, it is possible to perform parallel comparisons and swaps by following the order in which the registers are compared, first odd registers and then even registers. Thus, for an array of n To make a fair comparison of the parallel sorting strategy against wavesorter strategy [5] in terms of the total number of required steps to sort an array, it is necessary to consider the steps used to read data from memory and the steps required to store the sorted data back to memory. The proposed approach is based on the same structure of the registers array used in the wavesorter strategy. With this kind of array, data can be stored in the array by sending a datum to the first register and later, when the second datum is sent to the first register, the value on the first array is shifted to the second register. Thus, for every datum sent to the array to be stored, values in registers are shifted to their respective adjacent registers. This process requires n steps [5]. The same number of steps is required to take data out from the array. This approach allows storing a new set of data in the array while the previous set is being sent back into the memory. Suffix sorting might imply more than one sorting iterations. If k sorts are required, then the parallel sorting requires to ((n+n/2) *k + n) to sort an array of n data [5]. Thus total number of steps required can be obtained by the following equation: fpssteps(n,k)=n((3/2)k+1 ----------------- (1) The parallel strategy leads to a significant reduction of about 30% compared to the wavesorter approach. Furthermore, in additional sorts the necessary number of steps for sorting is equal to the number of characters in the biggest group of identical characters [5] divided by 2 (remember that an additional sorting is implied if groups of identical adjacent characters appear in the array). This implies that in practice, it is possible to reduce more than 30 % the number of steps to solve the suffix problem. 3.3 Run Length Encoding The move-to-front transform (or MTF) is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. When efficiently implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression algorithms. The main idea is that each symbol in the data is replaced by its index in the stack of “recently used symbols”. For example, long sequences of identical symbols are replaced by as many zeroes, whereas when a symbol that has not been used in a long time appears, it is replaced with a large number. Thus at the end the data is transformed into a sequence of integers; if the data exhibits a lot of local correlations, then these integers tend to be small. 4 .RESULTS AND DISCUSSIONS We have simulated our design using Model Sim version 6.5.We have simulated two outputs. One shows the BWT transform output shown in fig.4.1,while others are compressed and decompressed output as shown in fig.4.2 & 4.3. Fig 4.3 Simulation Result of Decompression 5. CONCLUSION We have presented architecture that implements the BWT transform based on a parallel sorting block. The proposed architecture targets on the compression with the compression ratio of 4:1. This task includes the solution of the suffix sorting problem and the generation of the output string corresponding to the BWT. The proposed architecture can be scalable to more than 128 characters without significant changes on the control and sorting blocks. A full implementation of a BWT compression algorithm is under way; this involves the integration of the proposed architecture with Move to Front, Run Length Encoding and Entropy Encoder blocks. 6. SCOPE FOR FURTHER RESEARCH Future work includes the implementation of a new storing strategy that allows a reduction in the latency associated with sequential access of data. Also, pipeline registers can be between the two levels of comparators to reduce the critical path delay and thus, increase the maximum frequency. Fig.4.1 Simulation Result for Burrows–Wheeler Transform Fig. 4.2 Simulation Result for Compression 7. ACKNOWLEDGMENT The authors acknowledge the contributions of the students, faculty of Velalar College of Engineering and Technology for helping in the design of compression circuitry, and for tool support. The authors also thank the anonymous reviewers for their thoughtful comments that helped to improve this paper.The authors would like to thank the anonymous reviewers for their constructive critique from which this paper greatly benefited.. 8.REFERENCES [1] A. Alameldeen and D. A. Wood, (2004) “Frequent pattern compression: A significance-based compression scheme for 12 caches,” Dept. Comp. Scie. , Univ. WisconsinMadison, Tech. Rep. 1500. [2] A.Deepa, M.Nisha Angeline and C.N. Marimuthu,(2011) “ P-Match: A Microprocessor Cache Compresion Algorithm”, 2nd International Conference on Intelligent Information Systems and Management (IISM’11), July 14- 16., P.No.98, [3] E.G.Hallnor and S.K.Reinhardt, (2004) “A compressed memory hierarchy using an indirect index cache,” in Proc. Workshop Memory Performance Issues, pp. 9–15. [4] J.-S. Lee et al., (1999) “Design and evaluation of a selective compressed memory system,” in Proc. Int. Conf. Computer Design, pp. 1 [5] José Martínez, René Cumplido, Claudia Feregrino,( 2005)”An FPGA-based Parallel Sorting Architecture for the Burrows Wheeler Transform”, Proceedings of the 2005 International Conference on Reconfigurable Computing and FPGAs (ReConFig 2005) 07695-2456-7/05-$20.00 © 2005 IEEE. [6] Ming-Bo Lin, Member, IEEE, Jang-Feng Lee, and Gene Eu Jan (SEPTEMBER 2006)“A Lossless Data Compression and Decompression Algorithm and Its Hardware Architecture”, IEEE Transactions on Very Large Scale Integration (VLSI) systems, vol. 14,no.9, [7] M.Nisha Angeline, S.K.Manikandan,and Prof.S.Valarmathy(, July-August 2012) ‘‘VLSI Design Of Cache Compression in Micro Processor using Pattern Matching Technique’’ in IOSR journal of Electronics and Communication Engineering, Vol 1,Issue 6. [8] Pujara P., and Aggarwal A., “Restrictive compression techniques to increase level 1 cache capacity”, Proc. Int. Conf. Computer Design, pp. 327–333. [9] Xi Chen, Lei Yang, Robert P. Dick (August 2010) “C-Pack: A High – performance Microprocessor Cache Compression Algorithm. IEEE Transactions On Very Large Scale Integration (VLSI) Systems, Vol.18, No. 8. 9. AUTHOR’S PROFILE Gomathi.S has received B.E degree in Electronics and Communication Engineering from Government College of Engineering, Bargur under Anna University, Chennai in 2005. She was working as a Lecturer in the academic year of 2006- 2012. She was delivered a lecture in AICTE sponsored Staff development Programme and She has participated in various National workshops and seminars. She is currently pursuing Master of Engineering in VLSI Design at Velalar College of Engineering and Technology under Anna University, Chennai. Her areas of interest in research are VLSI Signal Processing, VLSI Design Techniques. Nisha Angeline. M has received B.E degree in Electronics and Instrumentation Engineering from the Indian Engineering College, Thirunelveli District in 2003 and M.E –VLSI Design in Kongu Engineering College, Perundurai in the year 2006. She received first rank in PG course for her academic performance. She is working as a Professor (Sr.Gr) in the department of ECE, Velalar College of Engineering and Technology, Erode. Currently she is doing Ph.D under Anna University in the area of VLSI Design. She has published three papers in International Journals and presented five papers in National and International Conferences. Her areas of interest are Digital Electronics, VLSI Architecture, and VLSI Signal Processing.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download VLSI DESIGN OF CACHE COMPRESSION IN MICROPROCESSOR