Download Single-pass Cache Optimization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Multi-core processor wikipedia , lookup

Stream processing wikipedia , lookup

Transcript
Single-pass Cache Optimization
Clive Butler and Ruofan Yang
Clive Butler
Introduction of Problem
•
Embedded system execute a single application or a class of applications
repeatedly
•
Emerging methodology of designing embedded system utilizes configurable
processors
•
Size, associativity, and line size.
•
Energy model and an execution time model are developed to find the best
cache configuration for the given embedded application.
•
Current processor design methodologies rely on reserving large enough chip
area for caches while conforming with area, performance, and energy cost
constraints.
•
Customized cache allows designers to meet tighter energy consumption,
performance, and cost constraints.
Introduction of Problem
•
In existing low power processors, cache memory is known to consume a large
portion of the on-chip energy
•
Cache consumes up to 43% to 50% of the total system power of a processor.
•
In embedded systems where a single application or a class of applications are
repeatedly executed on a processor, the memory hierarchy could be
customized such that an optimal configuration is achieved.
•
The right choice of cache configuration for a given application could have a
significant impact on overall performance and energy consumption.
Introduction of Problem
• Estimating the hit and miss rates is fairly easy using tools
such as Dinero.
• Can be enormously time consuming to do so for various
cache sizes, associativities and line sizes.
• To use Dinero to estimate cache miss rate for a number of
cache configurations means that a large program trace
needs to be repeatedly read and evaluated which is time
consuming.
• Very time consuming.
Dinero
• Dinero is a trace-driven cache simulator
• Simulations are repeatable
• One can simulate either a unified cache
(mixed, data and instructions cached together)
or separate instruction and data caches.
• Cheaper (Hardware)
Dinero
• A din record is two- tuple label address.
• Cache parameters are set by command line options
• 0 read data, 1 write data, 2 instruction fetch. 3
escape record, 4 escape record (causes cache flush).
• Dinero uses the priority stack method of memory
hierarchy simulation to increase flexibility and
improve simulator performance in highly associative
caches.
Introduction
Method
1
Introduction
Tree-base
Method
• Presents a methodology to rapidly and accurately explore the
cache design space
• Done by estimating cache miss rates for many different cache
configurations simultaneously; and investigate the effect of
different cache configurations on the energy and performance
of a system.
• Simultaneous evaluation can be rapidly performed by taking
advantage of the high correlation between cache behavior of
different cache configurations.
ASP-DAC paper
General Simulation Process
m(max)…..m(min)…..0
tag
Cache addr
.
Array (stores tree addresses)
Tree
Step 1:
index
Step 3:
Find node and
go link list
Cache Miss Table
L NA
# of cache
miss
Step 2:
Go to tree addr.
and traverse the
list
Link List
Step 4:
Look for
match
ASP-DAC paper
Tree example
1010
101(0)
10(00)
1(000)
1(100)
101(1)
10(10)
1(010)
Cache Size 2
10(11)
10(01)
1(110)
1(001)
1(101)
tag
Assume each forest has fix line size
Cache Size 4
1(011)
1010
Cache addr
1(111)
.
Bits are use find path (k)
Cache Size 8
ASP-DAC paper
Link list set associative
Assoc. = 1
Most recent
Missused
element
Assoc. = 2
Hit
Assoc. = 4
Hit
Least recently
Hit
used element
Table for Miss Count
L N A
1 4 1
*Rest of address is use as tag
# of Cache Miss
0
1
ASP-DAC paper
Link List LRU update
Assoc. = 1
Most recent
element used
Assoc. = 2
Assoc. = 4
Least recently
used element
Table for Miss Count
L N A
1 4 1
# of Cache Miss
0
1
Detail Trace Example
Example Specifications:
• Cache Size (N) will vary from 32 bits max to 2
bits min
• Associatively (A) will vary from 4 max to 1 min
• Cache Set Size (M) will vary from 8 max to 1
min
• Assume fix line size (L)
Detail Trace Example
Instruction Trace
k |m
1. 000000 => 0
2. 001000 => 8
3. 010000 => 16
4. 000000 => 0
5. 001000 => 8
6. 000000 => 0
7. 010000 => 16
Assoc. = 1
L
N M
Miss
Count
1
8
8
3231
1
4
4
3231
1
2
2
523451
1
1
1
72345671
16
8
0
1
8
11
111
0
10
110
101
16
100
01
011
M=2
16
0
8
00
010
16
001
8
M=1
M=4
0
000
0
M=8
Detail Trace Example
Instruction Trace
k |m
1. 000000 => 0
2. 001000 => 8
3. 010000 => 16
4. 000000 => 0
5. 001000 => 8
6. 000000 => 0
7. 010000 => 16
Assoc. = 2
L
N M
Miss
Count
1
16 8
3
1
8
4
3
1
4
2
3
1
2
1
6654321
16
8 16
0
8
0
1
111
10
110
M=2
0
11
101
01
100
011
M=4
00
010
M=1
001
000
M=8
ASP-DAC Results
•Using benchmarks from Mediabench
•This method is on average 45 times faster to explore the design space.
•compared to Dinero IV
•Still having 100% accuracy.
Introduction Table-based Method
• Two cache evaluation techniques include analytical modeling
and execution-based evaluation to evaluate the design space
• SPCE present a simplified, yet efficient way to extract locality
properties for an entire cache configuration design space in
just one single-pass
• Includes related work, overview of SPCE, properties for
addressing behavior analysis to estimate the cache miss rate,
experiment and the results
Related Work
• Much research exist in this area need multiple passes to
explore all configurable parameters or employ large and
complex data structures, which restricting their applicability
• Algorithms for single-pass cache simulation exams
concurrently a set of caches. Mattson; Hill and Smith;
Sugumar and Abaham; Cascaval and Padua
• Janapsatya et al. present a technique to evaluate all different
cache parameters simultaneously, but not designed with a
hardware implementation in mind
• This paper’s methodology use simple array structures which
are more amenable to a light-weight hardware
implementation
SPCE Overview
Definitions
• Time ordered sequence of referenced addresses -- T[t] (t is a
positive integer),length |T|, such that T[t] is the t(th) address
referenced
• If T[ti] b = T[ti + d] b, then the addresses T[ti] and T[ti +
d], are references to the same cache block of 2^b words
• Define d as the delay or the number of unique cache
references occurring between any two references where T[ti]
b = T[ti + d] b
Definitions
• Evaluate the locality in the sequence of addresses T[ti] of a
running application ai by counting the occurrences where T[ti]
b = T[ti+d] b and registering it in the cell L(b, d) of the
locality table.(2^b is block size , d is delay)
Fully-Associative
• A fully-associative cache configuration is defined by the
notation cj (b, n), where b defines the line size in terms of
words, and n the total number of lines in the cache
• The locality table L(b, d) composes an efficient way to
estimate the cache miss rate of fully-associative caches
Fully-Associative Example
T
T
b=0
t0 0
t0
0
t1 8
t1
1
t2 16
t2
2
t3 0
t3
0
t4
1
t5 0
t5
0
t6 16
t6
2
t4 8
A sequence
of addresses
T
b
d=3
d=2
Locality table for
the trace T
d
b=0
1
0
2
1
3
3
4
0
Set-Associative
• Most real-world cache devices are built as direct-map or setassociative structures
• Since conflicts, L cannot be used to estimate misses , so define
s as the number of sets independent of the associativity, for
direct-mapped, set size=1, s=n
• To analyze the cache conflicts, we build conflict table Kα (b is
block size, s is set size), which in composed of α layers, one for
each associativity explored
Set-Associative
Set-Associative
• The value stored in each element of the table Kα(b, s)
indicates how many times the same block (size 2^b) is
repeatedly referenced and results in a hit.
• A given cache configuration with level of associativity w is
capable of overcoming no more than w − 1 mapping conflicts.
• The number of cache hits is determined by summing up the
cache hits from layer α = 1 up to its respective layer α = w,
where w refers to the associativity.
Algorithm Implementation
Experiment Setup
• Implement SPCE as a standalone C++ program to process an
instruction address trace file, gathered instruction address
traces for 9 arbitrarily chosen from Motorola’s Power Stone
benchmark suite using Simple Scalar
• Since 64 bytes is the largest block size in the design space
utilized, bmax=3; smax is defined by configuration with the
maximum number of sets in the design space
• Exam performance for our suite of benchmarks with SPCE and
also with a very popular trace-driven cache simulator
(DineroIV)
Results
• Compare performance of SPCE and DineroIV for the
45 cache configurations.
Conclusion
• Both Tree-based method and Table-based method
(SPCE) facilitate in ease of cache miss rate estimation
and also in reduction in simulation time.
• Compared to DineroIV method, the average speedup
is around 30 times.
• Our future work includes extending the design space
exploration by considering of a second level of cache.