Download presentation slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Techniques de compilation pour la
gestion et l’optimisation de la
consommation d’énergie des
architectures VLIW
Thèse de doctorat
Gilles POKAM*
15 Juillet 2004
*Financement CIFRE de STMicroelectronics
1
Low Power Compilation Techniques
on VLIW Architectures
Ph.D. Thesis
Gilles POKAM*
July 15, 2004
*Thesis
funded by STMicroelectronics
2
Motivation

root causes of increase performance
 higher clock frequency
 augmentation rate of ~30% each two years
 makes programs run faster
 higher level of integration density
 process scaling following Moore’s law
 grows the architecture complexity

power consumption is quickly becoming a
limiting factor
3
Illustration of power density growth for
general purpose systems
Power Density (W/cm2)
10000
1000
Nuclear
Reactor
100
today 2004!
8086 Hot Plate
10 4004
8008 8085
386
286
8080
1
1970
1980
P6
Pentium®
486
1990
Year
2000
2010
4
Power as a design cost constraint in
embedded systems

embedded systems examples


PDAs, cell phones, set-top boxes, etc …
key points affecting design cost include :
 average energy (battery autonomy)
 heat dissipation (packaging cost)
 peak power (components reliability)

In this thesis we are concerned with total
power consumption
5
Agenda








Motivation
Thesis objectives
Program analysis
Power consumption
ILP compilation analysis
Adaptive cache strategy
Adaptive processor data-path
Conclusions
6
The goals of this thesis

to understand the energy issues
involved when compiling for
performance on VLIW architectures

to come out with hardware/software
solutions that improve energyefficiency
7
Why VLIW architectures ?

popular in embedded systems




Philips TriMedia Processor
Texas Instrument TMS320C62xx
Lx Processor HP/STMicroelectronics
provide power/performance alternative to
general purpose systems


statically scheduled processor
compiler is responsible of extracting
instruction level parallelism (ILP)
8
Research methodology

our analysis standpoint lies in the compiler
 we therefore consider program analysis as a
basis for exploring energy reduction
techniques

power is also concerned with the
underlying micro-architecture
 we also consider the adequation of the
hardware and the software to reduce energy
consumption
9
Thesis contributions
1.
Program analysis
 a methodology for characterizing the dynamic
behavior of programs at static time
2.
VLIW energy issues
 heuristic to comprehending the energy issues
involved when compiling for ILP
3.
Hardware/Software adequation
 adaptive compilation schemes targeting
1. the cache subsystem
2. the processor data-path
10
Thesis experimental
environment
Lx VLIW processor








4-issue width
64 GPR, 8 CBR
4 ALUs, 2 MULs, 1 LSU, 1
BU
32KB 4-way data cache
32B data cache block size
32KB 1-way instruction
cache
64B instruction cache line
size
Power model provided by
STMicroelectronics
Benchmarks

MiBench suite

e.g. fft, gsm, susan …

MediaBench suite

e.g. mpeg, epic …

PowerStone suite

e.g. summin,
whestone, v42bis …
11
Agenda








Motivation
Thesis objectives
Program analysis
Power consumption
ILP compilation analysis
Adaptive cache strategy
Adaptive processor data-path
Conclusions
12
Why do we need to analyze
programs ?

knowledge of the dynamic behavior of a program is essential
to determine which program region may benefit most from
an optimization

programs use to execute as a series of phases; each phase
having varying dynamic behavior [Sherwood and Calder, 1999]

a phase can be assimilated to a program path which occurs
repeatedly

exposing the most frequently executed program paths, i.e.
hot paths, to the compiler may help discriminate among
power/performance optimizations
13
Our approach for program
paths analysis

whole-program level instrumentation ([Larus, PLDI 2000])
with main focus on basic block regions

signature to differentiate among dynamic
instances of the same region

program paths processed with suffix array to
detect all occurrences of repeated sub-paths

heuristics to select hot paths among the subpaths that appear repeated in the trace
14
Approach overview:
detecting occurrences of repeated subpaths
Suffix array
Dynamic
signature
Suffix sorting algorithm based on KMR to
detect all occurrences of repeated sub-paths
[Karl, Muller and Rosenberg, 1972]
15
Hot paths selection

not all repeated sub-paths are of interest :
 Local coverage: provides local behavior of region
 Global coverage: provides the weight of region in
program
 Distance reuse: average distance of consecutive
accesses to a region
16
Results summary
Bench
Percentage of
hot paths
Local coverage
(% exec instr.)
Glo. coverage
(% exec instr.)
Dist. Reuse
(# of BB)
dijkstra
2.81
0.09
47
1.74
adpcm
5.88
< 0.005
90
0.00
blowfish
27.01
0.06
24
85.00
fft
11.7
< 0.005
7
4.21
sha
20.0
0.06
72
0.75
bmath
15.22
0.05
37
19.21
patricia
5.85
0.15
65
24.84
17
Agenda








Motivation
Thesis objectives
Program analysis
Power consumption
ILP compilation analysis
Adaptive cache strategy
Adaptive processor data-path
Conclusions
18
Back to basis …
Power = ½ CL VDD2 a f + VDDIleakage
dynamic power
static power
CL
future technology
Current technology
trend [SIA, 1999]
50% 90%
~50%
~10%
19
Software opportunities for
power reduction
2
1
P  C L  V dd  a  f  V dd  I leak
2
dynamic power
Common techniques:
• clock-gating for activity reduction
• power supply voltage scaling
• frequency scaling
static power
Common techniques:
• power supply voltage
scaling
20
Agenda








Motivation
Thesis objectives
Program analysis
Power consumption
ILP compilation analysis
Adaptive cache strategy
Adaptive processor data-path
Conclusions
21
Problem summary


we want to understand under which
conditions compiling for ILP may
degrade energy
main motivation comes from the
relation between power growth and ILP
compiler
architecture
Power ~ IPC 
complexity
VLIW compiler

for the rest of this study assume
not be modified (fixed microarchitecture)

can
22
Metric used


energy and performance must be considered
conjointly [Horowitz] to leverage program
slowdown and energy reduction
performance to energy ratio (PTE)
PTE 
IPC N performance
1


energy
EnergyBB  CycleBB
E BB
Goals


compare two instances of the same program at the
software level
lay emphasis on the range of performance values
(IPC) that may degrade energy
for a given ILP transformation if energy growth is more
important than obtained performance improvement the
resulting PTE is degraded

23
Energy Model

the execution of a bundle wn dissipates
an energy EPBwn :
EPBwn  Ec  IPCwn  Eop  m  p  Es  l  q  Emiss
Energy
base cost

Energy due to
execution of bundle
Energy due to
D-cache misses
Energy due to
I-cache misses
consider loop intensive kernels …
EPBwn  Ec  IPCwn  Eop  m  p  Es  l  q  Emiss
24
We consider hyperblock
transformation
What is an hyperblock ?
Why hyperblock ?

most optimizations do
not generate extra
work, optimizing for
performance =
optimizing for power

hyperblock augment
instructions count,
how does this affect
energy ?
br
H
Hammock region R


Hyperblock H
construct predicated BB
out of a region of BBs
correct the effect of
eliminating branch
instructions by adding
compensation code
25
Tradeoff analysis

transformation
heuristic

influence of c on IPCH
PTE H  PTE R
IPC H 

c<0
extra work due to
compensation code
a  IPC R
b  c  IPC R
C=0
no degradation
no benefit
impact due to added
instructions
c   m  N  f  n  f  N  n  E
R
R
R
Hammock region R
m is nb of BB in R
N is nb of operations in R or H
n is nb of bundles in R or H
f is execution frequency
H
H
H
Hyperblock H
op
C>0
Optimal config
26
Conclusions

heuristic shows 17% improvement on a small subset of
Powerstone benchmarks

improvement on all benchmarks is restricted due to:



available ILP: for a given IPC value, ILP transformation must
result into much higher IPC (e.g. case c < 0)
machine overhead: small IPC improvement has no impact on
energy whenever machine overhead dominates (e.g. c <= 0)
suggested research directions
 better usage of available ILP via knowledge of phases
execution behavior
hot program paths
 better managing machine overhead via the adequation of
the architecture requirements of a program region

to
27
Agenda








Motivation
Thesis objectives
Program analysis
Power consumption
ILP compilation analysis
Adaptive cache strategy
Adaptive processor data-path
Conclusions
28
Why cache ?

highly power (dynamic and static) consuming components
 typically 80% of total transistor count
 occupies about 50% of total chip area

usually appears with a monolithic configuration in
embedded systems (per-application configuration)

varying program phase behavior may suggest us that no
best cache size exists for a given application
 adequation of cache configuration with program behavior on a
per-phase basis

reduction of the number of active and passive transistors
reduction of dynamic and static power
29
Two major proposals

Albonesi [MICRO’99]:
selective cache ways


disable/enable cache ways
Way 0
Way 1
Way 2
Zhang & al. [ISCA’03]: wayconcatenation

reduce cache associativity
while still maintaining full
cache capacity
Way 0
Way 3
32K 4-way
@ concatenate
16K 2-way
Way 2
@
concatenate
32K 4-way
32K 2-way
Problem

disabling cache ways
causes lost of data
impossible to recover to
previous cache cells state!
Way 1
Way 3
Problem

data coherency problem
across different cache
configurations!
30
Program regions analysis
summin (MiBench)
Config 2
32K 4-way
32K 2-way
Config 0
32K 4-way
32K 2-way
16K 2-way

program regions
are sensitive to
cache size and
associativity

key idea
Config 3
16K 2-way
Config 1
32K 1-way
16K 1-way
8K 1-way
 vary associativity
and size
according to
characteristics of
program regions
31
Solution for varying cache size

how to keep data ?


unaccessed cache ways are put in a low
power mode (drowsy mode)
drowsy mode [Flautner ISCA’02]


scales down
state
Advantage:


V dd to preserve memory cells
static power is reduced as a by-side of scaling downV dd
Disadvantage

1 cycle delay to wake up a drowsy cache way !
32
Solution for varying degree of
associativity

maintain data coherency via cache line
invalidation



tag array is maintained active to monitor write
accesses
cache controller invalidates cache lines with
old copy on a write access
we save dynamic energy because

lower associativity caches access few
memory cells than higher ones
reduction
of switching transition “a”
33
Results summary

three cache designs are compared
1.
2.
3.
no adaptive cache scheme
adaptation on a per-application basis
adaptation on a per-phase basis (our scheme)

6 out of 8 applications are sensitive to cache size and associativity,
resulting in dynamic power reduction of up to 12%

static energy is reduced drastically, on average 80% on all
benchmarks

performance can suffer from the one cycle wake up delay. Two
applications show ~30% degradation, from which 65% is due to the
one cycle delay needed to wake up a drowsy cache way

better cache way allocation policy can improve this result
34
Agenda








Motivation
Thesis objectives
Program analysis
Power consumption
ILP compilation analysis
Adaptive cache strategy
Adaptive processor data-path
Conclusions
35
Motivation

32 bit-width embedded processors are becoming popular

confluence of integer scalar programs and multimedia
applications on modern embedded processors

multimedia applications use to operate on 8-bit (e.g. video) or
16-bit data (e.g. audio)
 typically 50% of instructions in MediaBench [Brooks et al., HPCA’99]

detecting the occurrence of these narrow-width operands on a
per-region basis may allow
 the adequation of processor data-path width to the bit-width size
of a program region
36
Techniques to detect
narrow-width operands

Dynamic approach


detection on a cycle-by-cycle basis by
means of hardware (e.g. zero
detection logic)
clock-gate the un-significant bytes to
save energy
problem

efficient for GP systems, but required
hardware cost often not affordable for
embedded systems

Compiler approach


use static data flow analysis to compute
ranges of bit-width values for program
variables
re-encode program variables with
smaller bit-width size to save energy
problem


static analysis limits the opportunity for
detecting more narrow-width operands
re-encoding must preserve program
correctness
too conservative!
related work include


Brooks et al., HPCA’99
Canal et al., MICRO’00
related work include

Stephenson et al., PLDI 2000
37
Program regions analysis
adpcm (BB granularity)

the occurrence of
dynamic narrow-width
operands at the basic
block level can be
high

Key idea:
 adapt the underlying
processor data-path
width to the dynamic
bit-width size of the
region
38
Our approach
Dynamic approach
Compiler approach
avoid relying on hardware
support to detect the
occurrences of narrowwidth operands
avoid relying on static data
flow analysis to discover
bit-width ranges (too
conservative!)
take advantage of runtime
information to expose
dynamic narrow-width
operands to the compiler
use instead compiler approach
to decide when to switch from
normal to narrow-width mode
and vice-versa (reconfig.instr.)
speculative narrow-width execution mode
39
Speculative narrow-width execution:
micro-architecture

Recovering scheme



simple comparison logic at
execute stage
upon a miss
Static energy saving

pipeline is flushed
instruction is
replayed
with correct mode
slice-enable

recovery scheme may have
impact on both performance
and energy
8 bits 8 bits

Write-back
Slice-enable signal
(8/16/32 mode)
Bypass
(8/16/32 mode)
(8/16/32 mode)

(8/16/32
mode)
16 bits
unused register file slices
are put in a low-power
mode (drowsy mode) to
reduce static energy
Dynamic energy saving

(8/16/32 mode)
adaptive register file width
that can be viewed as either
a 8/16/32-bit register file
data-path clock-gating
when a narrow execution
mode is encountered
(pipeline latches, ALU)
40
Speculative narrow-width execution:
compiler support


regions are rarely composed of narrow-width
operands only …
address instructions (AI) usually require larger bitwidth size; split AI into




address calculation
memory access via accumulator register
schedule instructions within a region such that
those having one operand with 32-bit width are
moved around
insert reconfiguration instructions at each frontier
of a region
41
Results summary

impact of recovery scheme varies with missspeculation penalty and availability of narrowwidth operands




with 5 cycles penalty and 80% narrow-width availability
programs show no performance degradation
with 25 cycles penalty and 60% narrow-width availability
IPC degradation reaches 30%
overall, on the 13 applications from Powerstone,
the data-path dynamic energy is reduced by 17%
on average
we achieve a 22% reduction of the register file
static energy
42
Agenda








Motivation
Thesis objectives
Program analysis
Power consumption
ILP compilation analysis
Adaptive cache strategy
Adaptive processor data-path
Conclusions
43
Conclusions

power consumption is a matter of both software and hardware


software because program execution causes switching transitions
(dynamic power)
hardware because power consumption grows with architecture
complexity

hardware/software techniques must be used conjointly to provide
an effective basis for reducing power consumption

this thesis has provided arguments in favor of a profile-driven,
compiler-architecture symbiosis approaches to reduce power
consumption by
 detecting the occurences of program phases/regions
 discriminating optimizations that best benefit a phase/region
 adapting the micro-architecture w.r.t. the behavior of a phase/region
44
Future work

analogy between ILP and DLP
 investigate the energy issues involved with SIMD compilation
 need for SIMD energy model
 measure impact of overhead instructions (pack/unpack)

catching different program behaviors with a hot path
signature
 will allow to study the interplay of using different
reconfiguration techniques to save energy

energy impact of SIMD compilation with an adaptive i-cache

effectiveness of SIMD compilation to exploit narrow-width
operands (speculative vectorization techniques ?)
45