Download Part II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Islanding wikipedia , lookup

Stray voltage wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Chirp spectrum wikipedia , lookup

Time-to-digital converter wikipedia , lookup

Buck converter wikipedia , lookup

Spark-gap transmitter wikipedia , lookup

Resistive opto-isolator wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Ringing artifacts wikipedia , lookup

Microprocessor wikipedia , lookup

Voltage optimisation wikipedia , lookup

Utility frequency wikipedia , lookup

Alternating current wikipedia , lookup

Rectiverter wikipedia , lookup

Heterodyne wikipedia , lookup

Mathematics of radio engineering wikipedia , lookup

Mains electricity wikipedia , lookup

Transcript
MCD: A GALS Processor Microarchitecture
Dave Albonesi
in collaboration with
Greg Semeraro
Grigoris Magklis
Rajeev Balasubramonian
Steve Dropsho
Sandhya Dwarkadas
Michael Scott
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Outline
!
Motivation for MCD
!
MCD microarchitecture
!
Hiding synchronization delays
!
Fine-grain dynamic voltage and frequency scaling
" Off
line algorithm
algorithm
" Profiling
" Online
!
Potential performance gains with MCD
!
Future research
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Motivation for MCD
!
Increasing challenges of fully synchronous processors
!
Companies have a large investment in synchronous design
!
Designers know how to handle synchronizing signals between
clock domains
!
Gradual elimination of global signals creating more
autonomous units
" Example:
!
Replay Traps instead of pipeline holds
Single microprocessor-wide frequency constrains the
IPC/frequency tradeoffs that can be made in different units
" E.g.,
David H. Albonesi
floating point design decisions linked to front-end decisions
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Motivation for MCD
!
Multiple on-chip voltages today, progress in on-chip voltage
conversion
!
Global Dynamic Voltage Scaling (DVS) has limited
applicability
!
Application phases may be bottlenecked by a subset of the
major functions (fetch/dispatch, integer, floating point,
load/store) of a general-purpose processor, but still all run at
full speed in a synchronous processor
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
MCD at a high level
Front-end Domain
External Domain
L1 I-Cache
Main
Memory
Fetch Unit
Dispatch, Rename, ROB
Memory Domain
L2 Cache
Integer Domain
FP Domain
Issue Queue
Issue Queue
Ld/St Unit
ALUs & RF
ALUs & RF
L1 D-Cache
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
MCD microarchitecture details
!
Front-end, integer, floating point, memory clock domains
" Major
queues (issue queues, load/store queue, ROB) already in place
as buffers that can be used as synchronization points
" Synchronization can mostly be hidden if queues are partially full
" Much autonomy between these major functions
!
L1 Dcache separated from integer and floating point
" Allows
memory to be separately optimized
not adversely effected
" Performance
!
L2 cache placed in the memory domain
" No
L1-L2 synchronization penalty for loads/stores
with large L1 Icache miss rates may be impacted
" Applications
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Synchronization via queues
FIFO queue structure
!
Two types: FIFO and issue queue
!
Key insight: synchronization cost
can be hidden if instruction would
have waited in the queue anyways
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Hiding the synchronization delay
!
Instructions that end up waiting in queues after
synchronization
!
Group of instructions crossing a domain incur a single delay
!
Out-of-order execution
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Synchronization points
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Distribution of synchronization overhead
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Alpha 21264-like synchronization penalties
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
SA-1110-like synchronization sensitivity
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Fine-grain dynamic voltage scaling
!
Exploit imbalance of applications in their domain usage
" Scale
individual domain frequencies to match the demand
!
Effective over a variety of applications
!
Hardware approach: feedback and control system
" Appropriate
for legacy apps
" Hardware overhead
!
Software approach: profiling, insert special domain control
instructions
" Appropriate
for embedded and other applications which behave
consistently among different runs
" Recompilation or binary rewriting
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Voltage scaling hardware models
!
Voltage range of 1.2-0.65V, frequency range of 250MHz1GHz in each domain (same as baseline processor)
!
Independent jitter for each domain
" Calculate
next clock edge based on frequency, last clock edge and
jitter
" Synchronization penalties assessed based on clock edge relationships
!
“Transmeta-like” model
" Models
having to pause operation while increasing frequency and
voltage
" 32 voltage steps, 28.6mV intervals
" 20us per change
!
“Xscale-like” model
" Models
being able to operate through changes
steps, 2.86mV intervals
" 0.1718us to transition, but continue to execute
" 320
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Offline analysis
!
Provides target against which to compare more realistic
control algorithms
!
Can drive energy profiling tool, to help programmers
understand applications and hardware
!
Can drive re-writing tools for embedded applications
!
Summary of operation
" Run
application once at maximum speed
interval, collect dependences among primitive events
" Stretch events off the critical path, distribute slack as evenly as
possible
" Quantize to respect domain boundaries and reconfiguration overhead;
annotate application (simulator)
" Re-run application with chosen reconfiguration points, to measure real
energy savings and performance cost
" Every
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
“Shaker” Algorithm
!
Construct a dependence DAG from
simulator whose nodes are events, e.g.,
Enter instruction fetch queue
Enter an issue queue
" Start execution of an operation
"
"
!
Timestamp from simulator assigned to
each event
!
Arcs denote delay between events
!
Distribute any slack in the graph among
the arcs as evenly as possible
Goal: minimize the variance among
events in the same domain
" Alternately traverse the graph up and
down, gradually scaling events each time
" Continue until all slack is removed or all
events adjacent to slack edges are at
minimum frequency
"
!
O(cN), for N nodes and c frequency
steps
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Dilation thresholding
!
Creates schedule of domain frequencies based on previous
step and performance degradation threshold
!
For each domain do
"
For each interval do
#
#
"
Repeatedly merge neighboring intervals when profitable to do so
#
#
!
Construct a histogram of event frequencies from the DAG
Identify threshold of acceptable performance degradation
Merge histograms, calculate new frequency and energy savings, merge
intervals if improvement
Amortizes the cost of a voltage/frequency change over the time spent at
that voltage frequency for the “Transmeta” model
Output list of reconfiguration points
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Simulation Parameters
!
Resources similar to Alpha 21264
Voltage range: 0.65 – 1.2 V
Frequency range: 0.25 – 1 GHz
!
Representative benchmarks from:
!
!
"
"
"
!
Mediabench
Olden
SPEC 2000 (int and fp)
Three configurations:
"
"
"
MCD at maximum frequency (baseline MCD)
MCD with dynamic voltage scaling (dynamic MCD)
Single-clock with dynamic but global voltage scaling
!
No attempt to scale front-end domain
!
Transmeta-style model (freeze through change)
"
!
32 voltage steps: 20µs per step, 10-20µs for frequency change
XScale-style model (execute through change)
"
David H. Albonesi
320 voltage steps: 0.1718µs per step
MICRO-35 Partially Asynchronous Microprocessors Tutorial
“Transmeta” versus “Xscale” models
!
“Xscale” ability to operate through voltage/frequency
changes permits more frequent reconfigurations
!
Remaining data for “Xscale” model only
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Performance Degradation
18%
16%
14%
12%
10%
8%
6%
4%
2%
Baseline MCD
David H. Albonesi
Dynamic MCD
MICRO-35 Partially Asynchronous Microprocessors Tutorial
ar
t
sw
av im
er
ag
e
m
c
pa f
rs
er
gc
c
ts
p
bz
ip
2
m
st
po
w
e
tre r
ea
dd
ad
pc
m
ep
ic
g7
21
m
es
a
em
3d
he
al
th
0%
Energy Savings
40%
35%
30%
25%
20%
15%
10%
5%
0%
Baseline MCD
David H. Albonesi
Dynamic MCD
t
sw
i
av m
er
ag
e
ar
m
c
pa f
rs
er
gc
c
ts
p
bz
ip
2
st
po
w
e
t re r
ea
dd
m
ad
pc
m
ep
ic
g7
21
m
es
a
em
3d
he
al
th
-5%
Global Voltage Scaling
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Energy-Delay Product
40%
35%
30%
25%
20%
15%
10%
5%
0%
-5%
Baseline MCD
David H. Albonesi
Dynamic MCD
sw
av im
er
ag
e
t
ar
m
c
pa f
rs
er
gc
c
ts
p
bz
ip
2
po
w
e
t re r
ea
dd
st
m
ep
ic
g7
21
m
es
a
em
3d
he
al
th
ad
pc
m
-10%
Global Voltage Scaling
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Epic-decode – Runtime Example
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Ghostscript – Runtime Example
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Bisort – Runtime Example
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Power – Runtime Example
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Vortex – Runtime Example
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Art – Runtime Example
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Offline Result Summary
!
Dynamic MCD
"
"
"
!
Global voltage scaling
"
"
!
Less than 10% performance degradation
About 27% energy savings
20% energy-delay product
About 12% energy savings
3% energy-delay product
Appreciable variability among application phases
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Hardware control: the attack/decay algorithm
!
Exploits correlation between changes in input queue
utilization and domain frequency
!
Each domain operates independently
!
Can be implemented in ~10K transistors for a four-domain
processor
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Attack/decay design space
!
Deviation Threshold
" Difference
!
Reaction Change
" Amount
!
of frequency change on an attack
Decay
" Amount
!
in utilization needed to trigger an attack
of frequency decrease on a decay
Performance degradation threshold
" Amount
of performance degradation during the last interval below
which a frequency decrease is allowed in the next interval
!
Endstop count
" Consecutive
David H. Albonesi
an attack
intervals at max or mix frequency after which we force
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Hardware control: the attack/decay algorithm
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Attack/decay algorithm example #1
!
Changes in floating point queue utilization for epic decode
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Attack/decay algorithm example #1
!
Changes in floating point frequency for epic decode
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Attack/decay algorithm example #2
!
Differences in load/store queue utilization for epic decode
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Attack/decay algorithm example #2
!
Changes in load/store frequency for epic decode
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Performance degradation
30%
25%
20%
15%
10%
5%
0%
-5%
Baseline MCD
!
Dynamic-1%
Dynamic-2%
Dynamic-5%
a v vp r
er
ag
e
tsp
ro
no
i
ar
bz t
ip
eq 2
ua
ke
gc
c
gz
ip
m
c
m f
es
pa a
rs
er
sw
im
vo
rte
x
vo
bh
or
em t
3d
he
al
th
pe ms
rim t
et
e
po r
w
tre er
ea
dd
bis
ad
pc
m
ep
ic
jp
eg
g7
21
gh gs
os m
tsc
rip
m t
es
m a
pe
g2
pe
gw
it
-10%
Attack/Decay
Same overall performance degradation as offline with 1%
performance degradation target (Dynamic-1%)
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Energy savings
50%
40%
30%
20%
10%
Baseline MCD
!
Dynamic-1%
Achieves 86% of the energy savings as offline with 1%
performance degradation target (Dynamic-1%)
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
vpr
average
swim
Attack/Decay
vortex
parser
mcf
mesa
gcc
Dynamic-5%
gzip
equake
art
bzip2
tsp
power
Dynamic-2%
treeadd
mst
perimeter
em3d
health
bisort
bh
pegwit
mesa
mpeg2
gsm
ghostscript
jpeg
g721
epic
adpcm
-10%
voronoi
0%
Energy-delay improvement
50%
40%
30%
20%
10%
0%
Baseline MCD
!
Dynamic-1%
Dynamic-2%
Dynamic-5%
m
cf
es
pa a
rs
er
sw
im
vo
rte
x
av vpr
er
ag
e
m
t
vo sp
ro
no
i
ar
bz t
i
eq p2
ua
ke
gc
c
gz
ip
ad
pc
m
ep
ic
jpe
g
g7
21
gh gs
os m
tsc
rip
m t
es
m a
pe
g
pe 2
gw
it
bh
bis
or
em t
3d
he
al
th
p e ms
rim t
et
e
po r
w
tre er
ea
dd
-10%
Attack/Decay
Achieves 86% of the energy-delay improvement as offline
with 1% performance degradation target (Dynamic-1%)
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Attack/decay summary
!
Correlates input queue utilization changes with frequency
changes
!
Independent control for each domain
!
Implementable in a reasonable number of transistors
" ~0.1%
!
of a 10M transistor chip
Achieves 86% of the energy savings of an offline algorithm
with identical performance degradation
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Profile-driven MCD control
!
Profile run to identify long-running loops and functions for
which the cost of reconfiguration can be effectively amortized
!
Shaker and dilation thresholding algorithms to identify
domain frequencies for each
!
Identify functions, loops, and call tree paths at runtime
through table lookups
!
Distinguish “important” functions and loops and set
frequencies accordingly
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Distinguishing loop/function instances
!
Build a call tree, each node is a
function or loop instance
!
Distinguish different paths to a given
function or loop
!
Table lookup using path,
loop/function PC, and possible call
PC to identify function
!
If marked “important”, change
frequencies/voltages
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Performance comparison with offline
4.50%
4.00%
3.50%
3.00%
2.50%
2.00%
1.50%
1.00%
0.50%
Offline
David H. Albonesi
Loop+PC+path
Loop+path
Funct+PC+path
Funct+Path
Loop
Av
er
ag
e
ad
pc
m
_d
ec
od
ad
e
pc
m
_e
nc
od
e
ep
ic
_d
ec
od
e
ep
ic
_e
nc
od
e
g7
21
_d
ec
od
e
g7
21
_e
nc
od
e
gs
m
_d
ec
od
e
gs
m
_e
nc
od
jp
e
eg
_c
om
pr
jp
e
ss
eg
_d
ec
om
pr
es
m
s
pe
g2
_d
ec
od
m
e
pe
g2
_e
nc
od
e
0.00%
Funct
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Energy comparison with offline
25.00%
20.00%
15.00%
10.00%
5.00%
Offline
David H. Albonesi
Loop+PC+path
Loop+path
Funct+PC+path
Funct+Path
Loop
Funct
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Av
er
ag
e
ad
pc
m
_d
ec
od
ad
e
pc
m
_e
nc
od
e
ep
ic
_d
ec
od
e
ep
ic
_e
nc
od
e
g7
21
_d
ec
od
e
g7
21
_e
nc
od
e
gs
m
_d
ec
od
e
gs
m
_e
nc
od
jp
e
eg
_c
om
pr
jp
e
eg
ss
_d
ec
om
pr
es
m
s
pe
g2
_d
ec
od
m
e
pe
g2
_e
nc
od
e
0.00%
MCD performance optimizations
!
Each domain can run at its natural frequency
!
Global clock skew eliminated
" Saves
!
clock power and metal also
Dynamically trade off size for speed within each domain
using an adaptive MCD
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Dynamic resizing
!
Many dynamic resizing techniques proposed for power
!
Speed of an adaptive structure depends on configuration
" Adaptive
issue queue 70% faster at ¼ size
!
Synchronous system cannot exploit the faster speed of a
downsized structure due to other critical paths
!
Structure can be overly upsized for a particular application
but entire system must be slowed down (global clock)
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Adaptive MCD
!
Idea: upsize structures within MCD domains so as not to
impact other domain frequencies
!
Design each domain to be heavily pipelined for high
frequency (perhaps even overpipelined)
!
Make selected structures adaptive to exploit ILP or to match
larger working sets
!
Upsize structures and adjust frequency when IPC
improvement would override frequency decrease
!
Drawbacks
" Overpipelining
for slower frequencies (IPC penalty)
" Configuration overheads degrade clock speed relative to fixed design
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Adaptive MCD organization
Front-end Domain
L1
I-Cache
L1
L1
I-Cache
L1I-Cache
I-Cache
Br
Pred
Br
Br
Pred
BrPred
Pred
FP Domain
Issue Queue
Issue
Queue
Issue
Issue
Queue
IssueQueue
Queue
ALUs & RF
ALUs & RF
David H. Albonesi
Main
Memory
Fetch Unit
Dispatch, Rename, ROB
Integer Domain
External Domain
Memory Domain
L2Cache
Cache
L2
Cache
L2
L2
Cache
Ld/St Unit
L1D-Cache
D-Cache
L1
D-Cache
L1
L1
D-Cache
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Resizable structures
!
Front end domain
" Icache:
" Branch
#
#
!
gshare PHT: 4KB-64KB
Local PHT: 1KB-8KB, local BHT: 512 or 1024 entries
Integer and floating point domains
" Issue
!
4KB-64KB 2-way
predictor sized according to Icache
queue: 16-64 entries
Load/store domain
" Dcache:
" L2
32KB 1-way – 256KB 8-way
cache: 256KB 1-way – 2MB 8-way sized according to Dcache
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Methodology
!
Baselines
" Fixed
MCD: MCD with fixed frequencies with best overall TPI
" Fully synchronous: design with best overall TPI
!
Fully synchronous structures sized to balanced delays
!
Adaptive MCD additional branch penalty: 2 integer cycles
and 1 front end cycle
!
Adaptive MCD frequency penalty as much as 49%
!
Per-application adaptation (profiling)
!
Benchmarks
" 14
Mediabench
" 3 Olden
" 6 SPEC2000int
" 3 SPEC2000fp
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Fixed MCD versus synchronous
!
Best fixed MCD organization
" 32KB
Icache
gshare PHT, 4KB local PHT, 1KB local BHT
" 128KB Dcache, 1MB L2
" 16 entry queues
" 32KB
!
Best synchronous organization
" 64KB
Icache
gshare PHT, 8KB local PHT, 1KB local BHT
" 64KB Dcache, 512KB L2
" 32 entry queues
" 64KB
!
5% overall performance improvement
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Adaptive MCD versus synchronous
!
Adaptive MCD config for each benchmark
Dcache
Icache
32KB
4KB
adpcm encode, bzip2,
adpcm encode
8KB
mpeg2 encode, swim,
perimeter
16KB
jpeg compress, g721 encode,
g721 encode, equake
32KB
64KB
!
64KB
256KB
gzip
epic decode, bh
jpeg compress, vpr
gsm encode, ghostscript,
gsm encode
128KB
epic encode
parser,
mesa mipmap (IQ=32)
mesa, vortex
em3d (IQ=32)
mesa osdemo, gcc
18% overall performance improvement, min –3%, max 50%
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
Areas for further research
!
Best division into domains
!
Circuits for voltage/frequency islands
!
Front-end control (currently fixed)
!
Dynamic voltage gating for leakage
" Voltage
scaling works best when work is “smoothed out” over a long
period of time
" Voltage gating works best when work is “clumped together” to
introduce idle time
" Best combination of the two that optimizes energy-delay
!
Combining performance and energy features
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial
For More Info…
http://www.ece.rochester.edu/~albonesi/acal
David H. Albonesi
MICRO-35 Partially Asynchronous Microprocessors Tutorial