Download ppt - Ann Gordon-Ross

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Configurable Cache Subsetting for
Fast Cache Tuning
Pablo Viana1, Ann Gordon-Ross2, Eamonn Keogh2,
Edna Barros1, Frank Vahid2
Science Institute – CIN
Federal University of Pernambuco - UFPE, Brazil
1Computer
2Department
of Computer Science and Engineering
University of California, Riverside
This work was supported in part by the Capes Foundation BEX 1366/04-1
Outline
 Introduction


Configurable cache tuning
Cache configuration space
 Configuration space subsetting problem

An exhaustive analysis
 A heuristic to subset the configuration space
 Applying the subsetting method to a 2-level cache
 Future work on other configurable platforms
1/25
Introduction
 Caches can improve system performance. However,
caches are power hungry.
 Thus, the cache is a good candidate for optimization
44%
Power analysis of ARM970T
S. Segars (ISSCC, 2001)
2/25
Motivation
Tuning cache parameters (Total size, Line size, Associativity) to an
application can reduce energy by 60% on average.
Choose lowest energy
configuration
Microprocessor
Cache
Tuning
Energy

Main Memory
Possible Cache Configurations
3/25
Cache Configuration Tuning
Each application has different cache requirements
4kB, 2-way
Application 1
2kB, 1-way
16B/line
16B/line
Application 2
4kB, 4-way
32B/line
Application 3
Application 4
2kB, 1-way
8kB, 2-way
Application 5
64B/line
16B/line
...
8kB, 4-way
32B/line
Application N
Cache Configuration Space
Configurable cache optimization
4/25
Related Work
 Configurable caches

Soft-core processors
ARM;
 MIPS;
 Tensillica
 etc.


Hard-core processors
Motorola M*Core (Malik, ISLPED’00);
 Albonesi (MICRO’00);
 Zhang (ISCA’03)

 Configurable cache tuning



Mostly done manually in practice
Sub-optimal
Time-consuming
5/25
Related Work
 Automated Methods for Cache Tuning

Methods and heuristics for Design Space Exploration

Single-level caches (Tens of configurations)
Platune (Givargis TCAD’02, Palesi CODES’02)
 Zhang (RSP’03)


Two-level caches (Hundreds to thousands of configurations)

Tcat (Gordon-Ross, DATE’04)
Level 2
Level 1
Total size
Line size
Associativity
Say 50 configs.
*
Total size
Line size
Associativity
= 2500 configs
Say 50 configs.
6/25
Problem Context
 Do we really need such a large number of cache
configurations?
 Could just few carefully-chosen configurations adequately
cover the large configuration space?
 A smaller configuration space would be easier to explore (via
simulation-based or dynamic tuning)
7/25
Problem Context
Potential scenario:
 A configurable microprocessor vendor pre-selects a subset of
configurations for each particular application domain.
 The user selects the most appropriate domain for a target
application and examine only the pre-selected subset.
Automotive
control
Image
Filtering
Network
protocol
Configuration space
Application domains
Configuration subset
8/25
Cache Configuration Tuning
Energy consumption
Energy to run a given application on different configurations
Not that bad.
Energy increase
Lower energy
Possible cache configurations
9/25
Cache Configuration Subsetting
Application 1
Application 2
subset
Application 3
Application 4
Application 5
...
Application N
Cache Configuration Space
Near optimal cache tuning: Average energy increases
10/25
Cache Configuration Subsetting
Problem Definition:
 Identify the subset of configurations that adequately
covers the configuration space for the application
domain.
 We state that p configurations are necessary to cover a
space of m points.
Exhaustive Approach:

Select the subset of p configurations from the space m;
Criterion for selection:
Choosing the subset which keep the average energy of the
tuned cache nearest to the optimal.
11/25
Experimental Setup
 Configurable cache architecture for initial experiments:

Our target configurable cache architecture is based on
Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecture
for Embedded Systems,” ISCA 2003

We set the base cache to suport the following 18 configurations:
Instruction
cache
Memory
Processor
Data
cache
 Energy model for estimation:
e(cj, ai )  eCache eMemory  eCPUstall
12/25
Exhaustive Approach
y
 Subsets of size p=18 down to 1 were chosen through the
exhaustive evaluation of the average energy increase.
50%
Instruction cache
45%
Energy Increase
40%
Data cache
35%
30%
25%
20%
15%
Energy increase is still under 5%.
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
Size of the subset
12
13
14
15
16
17
18
13/25
Exhaustive Approach
 Good results, but exhaustively determining the subsets requires too
much computation:
 For m=18 and 1 < p < 18 it gives us 262,143 possible combinations.
That’s too expensive!
 We need a more efficient way to find the subset of configurations...
14/25
Looking for a Heuristic
 First attempt:

Choose the p configurations that offered optimal energy for the
largest number of applications.
50%
Energy increase
45%
1st trial I$
1st trial D$
Exhaustive I$
Exhaustive D$
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
15/25
Looking for a Heuristic
 Second attempt:

Hierarchical clustering according to the similarity between
configurations on their average energy savings.
60%
2nd trial I$
2nd trial D$
Exhaustive I$
Exhaustive D$
55%
Energy increase
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
16/25
Similar Problem
 We found a problem similar to our “subsetting puzzle”.
 The problem of segmenting time series.
Time series data set
Data set representation
Error (%) depends on average accuracy of the data set representation
17/25
Similar Problem
 The method proposed by Keogh’s can be applied, to reduce the
number of primary color for displaying an image

(IEEE Int. Conf. On Data Mining, 2001)
 Color/nuances of the image are merged according to the associated
error/difference in the final image.
36 colors
Merging and measuring error
Ex.: Lower average error?
by
or
8 colors
by
Error of replacing the neighboring colors by merging them
18/25
Adapting to the Subsetting Problem
 Similarly, we may iteratively discard configurations by merging them.
 By merging two configs cj and ck into ck means that all applications
which were tuned by cj, now use ck.
Application
ai
cj
ck
merging
e(cj,ai)
e(ck,ai)
energy increase
19/25
Keogh’s Heuristic
 All the possible merges of two adjacent (neighboring) configurations in
the space are evaluated to find the best merging.
c21
c71
cc13
7
20/25
Keogh’s Heuristic
 Accuracy of the results
60%
Keogh's I$
Keogh's D$
Exhaustive I$
Exhaustive D$
55%
Energy increase
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
21/25
Comparing the Results
 We went on merging the configuration space while the energy
increase remains under 5% (4 configs).
Keogh’s heuristic
Instruction cache
2kB 32B/line 1-way
2kB 32B/line 1-way
4kB 32B/line 2-way
4kB 32B/line 1-way
4kB 64B/line 1-way
4kB 32B/line 2-way
8kB 32B/line 1-way
8kB 64B/line 1-way
Average Error: 0.67%
Average Error: 0.69%
Exhaustive
Data cache
2kB 16B/line 1-way
2kB 16B/line 1-way
2kB 64B/line 1-way
2kB 64B/line 1-way
4kB 16B/line 2-way
4kB 16B/line 2-way
8kB 16B/line 2-way
8kB 16B/line 2-way
Average Error: 4.75%
Average Error: 4.75%
Runtime: 835,000 ms
Runtime: 991 ms
Speed up: 800x
22/25
Two-level Configurable Cache
Instruction
cache
Processor
Data
cache
2nd Level
cache
Memory
Optimal cache
configuration
1.00
0.90
0.80
0.70
Tuning with 4 configs
Average energy increase: 3.36%
0.60
0.50
0.40
0.30
0.20
0.10
0.00
23/25
Where Else this Method Could be Applied?
 Other parameterized IP cores

Buses, I/O devices (Word size, bandwidth, etc).
 Parameterized Platforms,

Processor’s functional units, cache, bus, multiplier, IPs.
 Optimization for low energy, area, performance and others.
24/25
Conclusions
 The configurable cache subsetting problem was presented;
 A Data Mining algorithm for segmenting time series was
adapted to tackle the cache subsetting problem;
 Subsetting the Two-level cache configuration space, keeping
the average energy increase under 5% takes:


Keogh’s heuristic: less than 1 minute
Exhaustively: around 53 hours.
 Thanks to the versatility of the proposed method, our next
step is to apply it for Platform customization.
Thank you !
25/25