Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Configurable Cache Subsetting for Fast Cache Tuning Pablo Viana1, Ann Gordon-Ross2, Eamonn Keogh2, Edna Barros1, Frank Vahid2 Science Institute – CIN Federal University of Pernambuco - UFPE, Brazil 1Computer 2Department of Computer Science and Engineering University of California, Riverside This work was supported in part by the Capes Foundation BEX 1366/04-1 Outline Introduction Configurable cache tuning Cache configuration space Configuration space subsetting problem An exhaustive analysis A heuristic to subset the configuration space Applying the subsetting method to a 2-level cache Future work on other configurable platforms 1/25 Introduction Caches can improve system performance. However, caches are power hungry. Thus, the cache is a good candidate for optimization 44% Power analysis of ARM970T S. Segars (ISSCC, 2001) 2/25 Motivation Tuning cache parameters (Total size, Line size, Associativity) to an application can reduce energy by 60% on average. Choose lowest energy configuration Microprocessor Cache Tuning Energy Main Memory Possible Cache Configurations 3/25 Cache Configuration Tuning Each application has different cache requirements 4kB, 2-way Application 1 2kB, 1-way 16B/line 16B/line Application 2 4kB, 4-way 32B/line Application 3 Application 4 2kB, 1-way 8kB, 2-way Application 5 64B/line 16B/line ... 8kB, 4-way 32B/line Application N Cache Configuration Space Configurable cache optimization 4/25 Related Work Configurable caches Soft-core processors ARM; MIPS; Tensillica etc. Hard-core processors Motorola M*Core (Malik, ISLPED’00); Albonesi (MICRO’00); Zhang (ISCA’03) Configurable cache tuning Mostly done manually in practice Sub-optimal Time-consuming 5/25 Related Work Automated Methods for Cache Tuning Methods and heuristics for Design Space Exploration Single-level caches (Tens of configurations) Platune (Givargis TCAD’02, Palesi CODES’02) Zhang (RSP’03) Two-level caches (Hundreds to thousands of configurations) Tcat (Gordon-Ross, DATE’04) Level 2 Level 1 Total size Line size Associativity Say 50 configs. * Total size Line size Associativity = 2500 configs Say 50 configs. 6/25 Problem Context Do we really need such a large number of cache configurations? Could just few carefully-chosen configurations adequately cover the large configuration space? A smaller configuration space would be easier to explore (via simulation-based or dynamic tuning) 7/25 Problem Context Potential scenario: A configurable microprocessor vendor pre-selects a subset of configurations for each particular application domain. The user selects the most appropriate domain for a target application and examine only the pre-selected subset. Automotive control Image Filtering Network protocol Configuration space Application domains Configuration subset 8/25 Cache Configuration Tuning Energy consumption Energy to run a given application on different configurations Not that bad. Energy increase Lower energy Possible cache configurations 9/25 Cache Configuration Subsetting Application 1 Application 2 subset Application 3 Application 4 Application 5 ... Application N Cache Configuration Space Near optimal cache tuning: Average energy increases 10/25 Cache Configuration Subsetting Problem Definition: Identify the subset of configurations that adequately covers the configuration space for the application domain. We state that p configurations are necessary to cover a space of m points. Exhaustive Approach: Select the subset of p configurations from the space m; Criterion for selection: Choosing the subset which keep the average energy of the tuned cache nearest to the optimal. 11/25 Experimental Setup Configurable cache architecture for initial experiments: Our target configurable cache architecture is based on Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecture for Embedded Systems,” ISCA 2003 We set the base cache to suport the following 18 configurations: Instruction cache Memory Processor Data cache Energy model for estimation: e(cj, ai ) eCache eMemory eCPUstall 12/25 Exhaustive Approach y Subsets of size p=18 down to 1 were chosen through the exhaustive evaluation of the average energy increase. 50% Instruction cache 45% Energy Increase 40% Data cache 35% 30% 25% 20% 15% Energy increase is still under 5%. 10% 5% 0% 0 1 2 3 4 5 6 7 8 9 10 11 Size of the subset 12 13 14 15 16 17 18 13/25 Exhaustive Approach Good results, but exhaustively determining the subsets requires too much computation: For m=18 and 1 < p < 18 it gives us 262,143 possible combinations. That’s too expensive! We need a more efficient way to find the subset of configurations... 14/25 Looking for a Heuristic First attempt: Choose the p configurations that offered optimal energy for the largest number of applications. 50% Energy increase 45% 1st trial I$ 1st trial D$ Exhaustive I$ Exhaustive D$ 40% 35% 30% 25% 20% 15% 10% 5% 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Size of the subset 15/25 Looking for a Heuristic Second attempt: Hierarchical clustering according to the similarity between configurations on their average energy savings. 60% 2nd trial I$ 2nd trial D$ Exhaustive I$ Exhaustive D$ 55% Energy increase 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Size of the subset 16/25 Similar Problem We found a problem similar to our “subsetting puzzle”. The problem of segmenting time series. Time series data set Data set representation Error (%) depends on average accuracy of the data set representation 17/25 Similar Problem The method proposed by Keogh’s can be applied, to reduce the number of primary color for displaying an image (IEEE Int. Conf. On Data Mining, 2001) Color/nuances of the image are merged according to the associated error/difference in the final image. 36 colors Merging and measuring error Ex.: Lower average error? by or 8 colors by Error of replacing the neighboring colors by merging them 18/25 Adapting to the Subsetting Problem Similarly, we may iteratively discard configurations by merging them. By merging two configs cj and ck into ck means that all applications which were tuned by cj, now use ck. Application ai cj ck merging e(cj,ai) e(ck,ai) energy increase 19/25 Keogh’s Heuristic All the possible merges of two adjacent (neighboring) configurations in the space are evaluated to find the best merging. c21 c71 cc13 7 20/25 Keogh’s Heuristic Accuracy of the results 60% Keogh's I$ Keogh's D$ Exhaustive I$ Exhaustive D$ 55% Energy increase 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Size of the subset 21/25 Comparing the Results We went on merging the configuration space while the energy increase remains under 5% (4 configs). Keogh’s heuristic Instruction cache 2kB 32B/line 1-way 2kB 32B/line 1-way 4kB 32B/line 2-way 4kB 32B/line 1-way 4kB 64B/line 1-way 4kB 32B/line 2-way 8kB 32B/line 1-way 8kB 64B/line 1-way Average Error: 0.67% Average Error: 0.69% Exhaustive Data cache 2kB 16B/line 1-way 2kB 16B/line 1-way 2kB 64B/line 1-way 2kB 64B/line 1-way 4kB 16B/line 2-way 4kB 16B/line 2-way 8kB 16B/line 2-way 8kB 16B/line 2-way Average Error: 4.75% Average Error: 4.75% Runtime: 835,000 ms Runtime: 991 ms Speed up: 800x 22/25 Two-level Configurable Cache Instruction cache Processor Data cache 2nd Level cache Memory Optimal cache configuration 1.00 0.90 0.80 0.70 Tuning with 4 configs Average energy increase: 3.36% 0.60 0.50 0.40 0.30 0.20 0.10 0.00 23/25 Where Else this Method Could be Applied? Other parameterized IP cores Buses, I/O devices (Word size, bandwidth, etc). Parameterized Platforms, Processor’s functional units, cache, bus, multiplier, IPs. Optimization for low energy, area, performance and others. 24/25 Conclusions The configurable cache subsetting problem was presented; A Data Mining algorithm for segmenting time series was adapted to tackle the cache subsetting problem; Subsetting the Two-level cache configuration space, keeping the average energy increase under 5% takes: Keogh’s heuristic: less than 1 minute Exhaustively: around 53 hours. Thanks to the versatility of the proposed method, our next step is to apply it for Platform customization. Thank you ! 25/25