Download MicroFix: Exploiting Path-grained Timing Adaptability for Improving

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MicroFix:
Exploiting Path-grained Timing Adaptability for
Improving Power-Performance Efficiency
Guihai Yan, Yinhe Han, Hui Liu,
Xiaoyao Liang, Xiaowei Li
Key Laboratory of Computer System and Architecture,
ICT (Institute of Computing Technology), CAS, Beijing, P.R. China
NVIDIA Corporation, USA
Outline
•
•
•
•
•
What’s Path-grained Timing Adaptability
(PTA)
Potential of PTA for Efficiency Improvement
How to Exploit PTA
Case Study Results
Conclusions
Impact of DVFS to Path Delay
T
T
P1
FF
T
Cycle Period
Critial Path
P2
FF
Non-critical Path
K-1
•
th
stage
th
K
stage
Traditionally, suppose voltage scaling down makes
P1 and P2 timing critical, then what?
•
Scaling down frequency to all stages of pipeline
Question:
• Can these emerging critical paths be salvaged to
trade more voltage scaling down?
•
Maybe Yes! By fine-grained time stealing
Timing Imbalance
Generous Flip-flop
(GFF)
T
T
FF
T
Cycle Period
Slack_up > TH, Slack_dn > TH
PCP
Forward Adaptable
Flip-flop (FAFF)
FF
NCP
Slack_up > TH, Slack_dn ≤ TH
Backward Adaptable
Flip-flop (BAFF)
FF
Slack_up ≤ TH, Slack_dn > TH
Unadaptable
Flip-flop (UAFF)
FF
Slack_up ≤ TH, Slack_dn ≤ TH
Intrinsic Timing Imbalance
•
Case study
•
•
•
FPU, adopted by OpenSPARC T1
Support all IEEE 754 floating-point data types
Synthesized by Synopsys Design Compiler with UMC 0.18um
technology
Cycle period: (1+10%) ×T critical
•
10000
# Flip-flops
1000
100
The GFFs,
FAFFs, and BAFFs take considerable even
dominated
proportion!
10
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
1
TH=0.1
Cycle
TH=0.15
Cycle
TH=0.2
Cycle
TH=0.25
Cycle
TH=0.3
Cycle
Attractive Potential
TH=0.35
Cycle
TH=0.4
Cycle
DVFS Exacerbating Imbalance
•
Generally, the time margin of longer paths diminish
much more faster than that of short ones
Slack_up
Example
n gates
Slack_dn
FF
m gates
Define: △S=|Slack_dn - Slack_up|
•
Assume that the path delay is the sum of delay of
gates on the path
•
•
•
Before voltage scaling down
•
•
TG : the gate delay
Delta: the delay change during the voltage scaling down
△S1 = (n-m) × TG
After voltage scaling down
•
△S2 = (n-m) × (TG + Delta)
△S1 < △S2
If the Imbalance be utilized…
•
Check the lower bound of cycle period T
•
Traditionally:
T1 = n× (TG+Delta)
•
From MicroFix’s perspective:
T2 = (m+n)/2 × (TG+Delta)
≤ T1 - TH
Note: preclude the UAFFs
F
T
Without MicroFix
n
(m+n)/2
2/(m+n)
F=1/T
δ= δ(V)
With MicroFix
δ
With MicroFix
1/n
Without MicroFix
1/V
How to deal with UAFFs?
•
Two-supply voltage scheme [Usami, JSSC’98]
[Ghosh, TCAD’07]
• Critical Isolation: the critical paths resulting in
UAFFs
• The supply voltage of Critical Isolation are
more conservative than that of other portion
out of Critical Isolation.
The exploitable scope of MicroFix
Critical Isolation
Powered by
Conservative Voltage
Powered by
Aggressive Voltage
How to “Fix’’?
•
•
•
Two supply voltage scheme
Timing sensors [Yan, DATE’09][Agarwal, VTS’07]
Multiple-phase Clocks (generated by a DLL)
Delay Error Prediction Signals
Timing
Sensors
UAFF
FAFF
BAFF
Timing
Sensors
GFF
FFs
Normal Voltage
Supply
K
FFs
…
…
Conservative Voltage
Supply
FCLK
Kth stage
Logic
…
…
(K-1)
FFs
…
…
(K-1)th stage
Logic
…
…
……
…
…
Voltage/
Frequency
Control
……
Target Pipeline
……
CLK
……
BCLK
……
CLK
T×TH
BCLK
CLK
FCLK
T×TH
FCLK
BCLK
Operational Principles
Reducing Power
Reduce voltage from V to V
Reduce frequency from F to F
V, F
V, F
Increase frequency from F to F
V, F
Increase voltage from V to V
Increasing
Performance
Ensure
that
the restored margin ‘v’ and ‘f ’ can
(a) Traditional
DVFS and frequency turning.
guard
safe voltage
Reduce voltage from V to V
Reduce frequency from F to F
V V-v
V, F
No error predicted
F F- f
Restore a
tight margin
Monitoring
Error
predicted
V, F
Error
predicted
Monitoring
V
No error predicted
F F+f
Increase voltage from V to V
Increase frequency from F to F
(b) MicroFix enhanced DVFS
Restore a
tight margin
V+ v
V, F
Experimental Setup
•
Gate-level
• Study the adaptability and overhead with a
synthesized FPU
– Timing info. -> PrimeTime
•
Transistor-level
• Investigated the Power-Performance tradeoffs
with Hspice simulations
– 32nm PTM models dedicated for HP and LP
applications, respectively.
Exploring Design Tradeoffs
•
‘TH’ play a critical role in determining the ultimate
Efficiency
What ‘TH’ is
optimal?
10000
# Flip-flops
1000
100
10
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
GFF
FAFF
BAFF
UAFF
1
TH=0.1
Cycle
TH=0.15
Cycle
TH=0.2
Cycle
TH=0.25
Cycle
TH=0.3
Cycle
TH=0.35
Cycle
TH=0.4
Cycle
Smaller ‘TH’, smaller CI, but less Larger ‘TH’, larger CI, but more
aggressive voltage reduction!
aggressive voltage reduction!
The exploitable scope of MicroFix
Critical Isolation
The exploitable scope of MicroFix
Critical Isolation
Exploring Design Tradeoffs /2
Percentage of Cells in Critical Isolation
40%
33.52%
35%
Percentage of Cells
•
30%
22.70%
25%
20%
15%
10.82%
10%
5%
0.00%
0.00%
0.16%
0.1
0.15
0.2
2.04%
0%
0.25
TH
0.3
0.35
0.4
Exploring Design Tradeoffs /3
Sensor Area Overhead
•
•
a sensor is about 8x that of a pipeline flip-flop (based on the
number of transistors) [Yan, DATE09]
The paths in the critical isolation and those with ‘over-larger’
slack (i.e. slack >T × TH + tmargin) do not need to be monitored
by sensors
14.0%
Sensor area overhead
•
12.34%
10.95%
12.0%
9.20%
10.0%
9.97%
8.0%
6.0%
3.75%
4.0%
2.0%
2.10%
0.00%
0.0%
0.1
0.15
0.2
0.25
TH
0.3
0.35
0.4
Exploring Design Tradeoffs /4
•
Sensor Power Overhead
• in the most pessimistic case (TH=0.3, all sensors
simultaneously flag timing errors): 14%
• HOWEVER, such worst-case power overhead can
hardly happen due to three reasons
1) Sensors do not need to be always on
2) It’s almost impossible all sensors flag impending timing
errors simultaneously
3) TH=0.3 actually is not a optimal configuration
Therefore, the pessimistic power overhead won’t
offset much efficiency of MicroFix!
Hspice Simulations
•
•
•
Object: Investigate the detailed delay-power
relation of the target pipeline
It is ideal to directly simulate the transistorlevel model of the target pipeline with
Hspice; however it is very labor-intensive
and time consuming.
So we took a indirect way to conduct the
Hspice simulations
Ptotal(V,F) = Pcomb(V,F)+Pff(V,F)
1/F = T = tc + tsetup + tc−to−q
Combinational Component
ISCAS85 (c432, c499, c880, c1355, c1908, c2670)
32nm PTM models (HP and LP versions)
6
1
5.5
0.9
5
0.8
4.5
4
Low Power
3.5
3
2.5
High Perf.
Normalized Power
Noramlized Delay
•
•
0.7
0.5
0.4
0.3
0.2
1.5
0.1
1
0
0.95 0.9
0.85 0.8 0.75 0.7
0.65 0.6
Voltage (V)
(a)
Normalized V-D and V-P relations
comply well with all of the
simulated benchmarks!
Low Power
0.6
2
1
High Perf.
1
0.95
0.9
0.85
0.8
0.75
Voltage (V)
(b)
0.7
0.65
0.6
Sequential Component
6
5.5
t_setup + t_c-to_q
5
t_setup + t_c-to_q
Normalized Power
Normalized Delay
1
0.9
4.5
4
3.5
3
Low Power
2.5
High Perf.
2
Low Power
0.5
0.4
0.3
0.2
α=1
α=0.5
α=0.25
α=1
α=0.5
α=0.25
0.1
0
1.5
1
1
0.9
0.8
Voltage (V)
(a)
•
•
High Perf.
0.8
0.7
0.6
V-D
V-P
0.7
0.6
1
0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6
Voltage (V)
(b)
Efficiency Comparsion
Sensor area overhead
14.0%
12.34%
10.95%
12.0%
9.20%
10.0%
9.97%
8.0%
6.0%
3.75%
4.0%
2.0%
2.10%
0.00%
0.0%
0.1
0.15
0.2
0.25
TH
0.3
0.35
TH = 0.2 is an optimal choice!
Efficiency Improvement: 35% EDP, 28% PDP
0.4
Conclusion
•
•
•
MicroFix can improve DVFS efficiency by
exploiting the path-grained adaptability
The timing imbalance threshold, TH, implies a
critical design tradeoff
The efficiency of EDP for HP application up to
35% and PDP for LP application up to 28%, at
the expense of only 7% area overhead
Thanks!
Q&A
Related documents