Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Single-pass Cache Optimization Clive Butler and Ruofan Yang Clive Butler Introduction of Problem • Embedded system execute a single application or a class of applications repeatedly • Emerging methodology of designing embedded system utilizes configurable processors • Size, associativity, and line size. • Energy model and an execution time model are developed to find the best cache configuration for the given embedded application. • Current processor design methodologies rely on reserving large enough chip area for caches while conforming with area, performance, and energy cost constraints. • Customized cache allows designers to meet tighter energy consumption, performance, and cost constraints. Introduction of Problem • In existing low power processors, cache memory is known to consume a large portion of the on-chip energy • Cache consumes up to 43% to 50% of the total system power of a processor. • In embedded systems where a single application or a class of applications are repeatedly executed on a processor, the memory hierarchy could be customized such that an optimal configuration is achieved. • The right choice of cache configuration for a given application could have a significant impact on overall performance and energy consumption. Introduction of Problem • Estimating the hit and miss rates is fairly easy using tools such as Dinero. • Can be enormously time consuming to do so for various cache sizes, associativities and line sizes. • To use Dinero to estimate cache miss rate for a number of cache configurations means that a large program trace needs to be repeatedly read and evaluated which is time consuming. • Very time consuming. Dinero • Dinero is a trace-driven cache simulator • Simulations are repeatable • One can simulate either a unified cache (mixed, data and instructions cached together) or separate instruction and data caches. • Cheaper (Hardware) Dinero • A din record is two- tuple label address. • Cache parameters are set by command line options • 0 read data, 1 write data, 2 instruction fetch. 3 escape record, 4 escape record (causes cache flush). • Dinero uses the priority stack method of memory hierarchy simulation to increase flexibility and improve simulator performance in highly associative caches. Introduction Method 1 Introduction Tree-base Method • Presents a methodology to rapidly and accurately explore the cache design space • Done by estimating cache miss rates for many different cache configurations simultaneously; and investigate the effect of different cache configurations on the energy and performance of a system. • Simultaneous evaluation can be rapidly performed by taking advantage of the high correlation between cache behavior of different cache configurations. ASP-DAC paper General Simulation Process m(max)…..m(min)…..0 tag Cache addr . Array (stores tree addresses) Tree Step 1: index Step 3: Find node and go link list Cache Miss Table L NA # of cache miss Step 2: Go to tree addr. and traverse the list Link List Step 4: Look for match ASP-DAC paper Tree example 1010 101(0) 10(00) 1(000) 1(100) 101(1) 10(10) 1(010) Cache Size 2 10(11) 10(01) 1(110) 1(001) 1(101) tag Assume each forest has fix line size Cache Size 4 1(011) 1010 Cache addr 1(111) . Bits are use find path (k) Cache Size 8 ASP-DAC paper Link list set associative Assoc. = 1 Most recent Missused element Assoc. = 2 Hit Assoc. = 4 Hit Least recently Hit used element Table for Miss Count L N A 1 4 1 *Rest of address is use as tag # of Cache Miss 0 1 ASP-DAC paper Link List LRU update Assoc. = 1 Most recent element used Assoc. = 2 Assoc. = 4 Least recently used element Table for Miss Count L N A 1 4 1 # of Cache Miss 0 1 Detail Trace Example Example Specifications: • Cache Size (N) will vary from 32 bits max to 2 bits min • Associatively (A) will vary from 4 max to 1 min • Cache Set Size (M) will vary from 8 max to 1 min • Assume fix line size (L) Detail Trace Example Instruction Trace k |m 1. 000000 => 0 2. 001000 => 8 3. 010000 => 16 4. 000000 => 0 5. 001000 => 8 6. 000000 => 0 7. 010000 => 16 Assoc. = 1 L N M Miss Count 1 8 8 3231 1 4 4 3231 1 2 2 523451 1 1 1 72345671 16 8 0 1 8 11 111 0 10 110 101 16 100 01 011 M=2 16 0 8 00 010 16 001 8 M=1 M=4 0 000 0 M=8 Detail Trace Example Instruction Trace k |m 1. 000000 => 0 2. 001000 => 8 3. 010000 => 16 4. 000000 => 0 5. 001000 => 8 6. 000000 => 0 7. 010000 => 16 Assoc. = 2 L N M Miss Count 1 16 8 3 1 8 4 3 1 4 2 3 1 2 1 6654321 16 8 16 0 8 0 1 111 10 110 M=2 0 11 101 01 100 011 M=4 00 010 M=1 001 000 M=8 ASP-DAC Results •Using benchmarks from Mediabench •This method is on average 45 times faster to explore the design space. •compared to Dinero IV •Still having 100% accuracy. Introduction Table-based Method • Two cache evaluation techniques include analytical modeling and execution-based evaluation to evaluate the design space • SPCE present a simplified, yet efficient way to extract locality properties for an entire cache configuration design space in just one single-pass • Includes related work, overview of SPCE, properties for addressing behavior analysis to estimate the cache miss rate, experiment and the results Related Work • Much research exist in this area need multiple passes to explore all configurable parameters or employ large and complex data structures, which restricting their applicability • Algorithms for single-pass cache simulation exams concurrently a set of caches. Mattson; Hill and Smith; Sugumar and Abaham; Cascaval and Padua • Janapsatya et al. present a technique to evaluate all different cache parameters simultaneously, but not designed with a hardware implementation in mind • This paper’s methodology use simple array structures which are more amenable to a light-weight hardware implementation SPCE Overview Definitions • Time ordered sequence of referenced addresses -- T[t] (t is a positive integer),length |T|, such that T[t] is the t(th) address referenced • If T[ti] b = T[ti + d] b, then the addresses T[ti] and T[ti + d], are references to the same cache block of 2^b words • Define d as the delay or the number of unique cache references occurring between any two references where T[ti] b = T[ti + d] b Definitions • Evaluate the locality in the sequence of addresses T[ti] of a running application ai by counting the occurrences where T[ti] b = T[ti+d] b and registering it in the cell L(b, d) of the locality table.(2^b is block size , d is delay) Fully-Associative • A fully-associative cache configuration is defined by the notation cj (b, n), where b defines the line size in terms of words, and n the total number of lines in the cache • The locality table L(b, d) composes an efficient way to estimate the cache miss rate of fully-associative caches Fully-Associative Example T T b=0 t0 0 t0 0 t1 8 t1 1 t2 16 t2 2 t3 0 t3 0 t4 1 t5 0 t5 0 t6 16 t6 2 t4 8 A sequence of addresses T b d=3 d=2 Locality table for the trace T d b=0 1 0 2 1 3 3 4 0 Set-Associative • Most real-world cache devices are built as direct-map or setassociative structures • Since conflicts, L cannot be used to estimate misses , so define s as the number of sets independent of the associativity, for direct-mapped, set size=1, s=n • To analyze the cache conflicts, we build conflict table Kα (b is block size, s is set size), which in composed of α layers, one for each associativity explored Set-Associative Set-Associative • The value stored in each element of the table Kα(b, s) indicates how many times the same block (size 2^b) is repeatedly referenced and results in a hit. • A given cache configuration with level of associativity w is capable of overcoming no more than w − 1 mapping conflicts. • The number of cache hits is determined by summing up the cache hits from layer α = 1 up to its respective layer α = w, where w refers to the associativity. Algorithm Implementation Experiment Setup • Implement SPCE as a standalone C++ program to process an instruction address trace file, gathered instruction address traces for 9 arbitrarily chosen from Motorola’s Power Stone benchmark suite using Simple Scalar • Since 64 bytes is the largest block size in the design space utilized, bmax=3; smax is defined by configuration with the maximum number of sets in the design space • Exam performance for our suite of benchmarks with SPCE and also with a very popular trace-driven cache simulator (DineroIV) Results • Compare performance of SPCE and DineroIV for the 45 cache configurations. Conclusion • Both Tree-based method and Table-based method (SPCE) facilitate in ease of cache miss rate estimation and also in reduction in simulation time. • Compared to DineroIV method, the average speedup is around 30 times. • Our future work includes extending the design space exploration by considering of a second level of cache.