Download Mechanisms for bounding vulnerabilities of processor structures

Document related concepts

Microprocessor wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Register renaming wikipedia , lookup

Transcript
Increasing Reliability of
Performance-critical Pipeline
structures
Niranjan Soundararajan
Advisors: Vijaykrishnan Narayanan
Anand Sivasubramaniam
Computer Systems Lab (CSL)
Microsystems Design Lab (MDL)
Computer Science and Engineering
The Pennsylvania State University
1
Reliability – Increasing Importance
Decreasing transistor size
More transistors
Power/Temperature Hotspots
Increasing Market Segments
HARDWARE
RELIABILITY
2
Performance critical
pipeline structures
FRONT END
BACK
ENDactivity
Out-of-order
entry
Back-to-Back wakeup
Load/Store
Multi-width
pipeline
Dcache
Queue increase
Clock frequency
BHT
BTB
Inst
Fetch
Icache
Decode
Alloc
RAT
Issue
Queue
ALU
Reorder Buffer
Inst
Retires
ARF
3
Transistor Failure
Failure Rate
Solutions to address impact
of Process Variations on
Issue Queue
Solutions to reduce nonuniform aging due to NBTI,
HCE on microprocessor
structures
Manufacturing
Defects
Wearout
Soft Error impact of
DVFS on vulnerability of
GALS architectures
Bounding vulnerability of
processor structures to
provide reliability guarantees
Random Errors
Time
4
Outline
Motivation
Contributions
Vulnerability bounding mechanisms
Other solutions
– Impact of DVFS on architectural vulnerability of
GALS architectures
– Address process variations in issue queue
– Mitigate NBTI, HCE degradation in structures
Conclusion and Future work
5
Introduction to Soft Errors
Error
N
1
0
n+
p
n+- -
+ +- +
+
Strike creates electron-hole pairs that can be absorbed
by source/diffusion areas of the transistor to change
state of device
Source: M. Tahoori
6
Impact of Soft Errors
Severity of Soft Error Rates
–
In 2003, Fujitsu released SPARC64
with 80% of 200,000 latches
covered by transient fault
protection
Single Event Upset (SEU) model
Metrics
–
MTBF : Mean Time Between Failures
Relative Soft Error Rate Increase
Severity
150
100
50
0
180
130
90
65
45
32
22
16
Chip Feature Size
– FIT : Failure in Time = 1 failure
in a billion hours.
FITeff = FITraw * AVF
Source: Shekar Borkar, Intel 2004
7
Architectural Vulnerability Factor
(AVF)
Architecturally Correct
Execution (ACE) Instruction
LD A
BR
Dead Store
ST B
Wrong Path
ADD
AVF
ST B
User Visible
Output
unACE Instruction
- Fraction of bits in a structure
vulnerable to soft errors
- ACE bits / (ACE bits + UnACE bits)
- Fn (Size, Time)
8
AVF: Why is it important to Micro-architects?
System
Specification
Architectural Design
Logic Synthesis
Circuit Design
AVF
per structure
AVF
System Reliability
= ∑ (FITraw * AVF)
Fabrication and Packaging
Physical Design
FITraw
9
State-of-Art
Microprocessor design: Multi-dimensional problem involving
Performance, Power and Reliability
Performance Overhead
Transient Fault Tolerance
– Simultaneous Redundant Threading (SRT)
– Lockstepping
Single point in
Optimization techniques
Performance-Reliability space
– Parashar et al., ISCA’04
– Gomaa et al., ISCA’05
– Parashar et al., ASPLOS’06
– Reddy et al., ASPLOS’06
10
Reliability
Micro-architectural Reliability Knob
More Reliable
Less Performance
FITrequired
Ideal Solution
FITeff = FITraw * AVF
FITraw and AVF being
constants
FITraw inflexible
Tune AVF to meet
specifications
Less Reliable
More Performance
Performance
“Challenge for computer architects is not to provide absolute guarantees
in reliability, but rather how to provide the adequate amount of reliability
at the lowest cost for the target market segment”
Architecture Design for Soft Errors – Shubu Mukherjee, Intel
11
Contributions
First work that provides microarchitectural knobs to satisfy processor
reliability budgets for transient faults
Proactive and Reactive mechanisms to
monitor and bound vulnerabilities of
processor structures at cycle-level
granularity
12
AVF Monitoring
Reorder Buffer/Physical Register File
RAT
Fetch
Decode
Reorder Buffer (ROB)
1. Large pipeline structure holding
number of instructions
ARF
Issue
Queue
ALU
Reorder
Buffer (PRF)
Commit
2. Each instruction spends significant
percentage of lifetime in ROB
Pipeline
In-order
Pipeline
out-of-order
Pipeline
In-order
13
AVF Monitoring Mechanism
Reorder Buffer (ROB)
R
Commit Event
Filled at
WB
Filled at
Dispatch
B
Reorder Buffer
N entries
Each entry B bits
Result R bits
Mis-speculation
N
Writeback Event
Dispatch Event
14
Vulnerability Control via Throttling (VCT)
D
I
S
P
A
T
C
H
Entire Entry
ACE at Dispatch
STALL DISPATCH
AND WRITEBACK
Size = Fn (AVF Bound)
N - Entry
REORDER BUFFER
W
R
I
T
E
B
A
C
K
Writeback
cannot be
stalled
15
thread
Avg Performance w.r.t single
VCT Performance
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
VCT
0%
20%
High Integrity
40%
60%
AVF Bounds
80%
100%
Low Integrity
16
Advantages of a Reactive Bounding
Mechanism
Reorder Buffer
AVF Bound Exceeded
Verify Results
Early Accounting of Writebacks
Mis-speculated Instructions
17
Simultaneous Redundant Threading (SRT):
Importance of Selective Redundancy
Fetch
RAT
ARF
RAT
ARF
ISQ
ALU
Decode
Redundant Thread
After Primary Thread
Reorder
Buffer (PRF)
Result Verification
Reduces AVF
Redundant Execution
protects entire pipeline
AVF goes down
18
Vulnerability Control via Selective
Redundancy (VCSR) Infrastructure
Fetch
RAT
ARF
RAT
ARF
Decode
Greedy
Heuristic
ISQ
ALU
Reorder
Buffer (ROB)
AVF Bound
Exceeded
Result Buffer
19
VCSR Performance
VCSR
SRT
0.9
VCT
0.8
thread
Avg Performance w.r.t single
1
0.7
0.6
0.5
0.4
0%
20%
High Integrity
40%
60%
AVF Bounds
80%
100%
Low Integrity
20
Optimizations
Primary Thread Out Of Order Commit
Non-compacting Reorder Buffer
Reduces AVF
Performance Boost since lesser inst
are re-executed
RAT
Fetch
ARF
Decode
RAT
ARF
Writeback –
Commit ROB AVF
affected
ISQ
ALU
Reorder
Buffer (PRF)
Sec. Thread maintains
architected state
Result Buffer
21
VCH with OOO Commit Performance
VCH(OOO)
0.9
SRT
VCSR
0.8
thread
Avg Performance w.r.t single
1
0.7
VCT
0.6
0.5
0.4
0%
20%
High Integrity
40%
60%
AVF Bounds
80%
100%
Low Integrity
22
Impact of vulnerability bounding
Per-cycle vulnerability bounds,
guaranteeing FIT rates are met
Future Work
– Looking at developing a system-level AVF
monitoring and bounding infrastructure
23
Outline
Motivation
Contributions
Vulnerability bounding mechanisms
Summary of other works
– Impact of DVFS on architectural vulnerability of
GALS architectures
– Address process variations in issue queue
– Mitigate NBTI, HCE degradation in structures
Conclusion and Future work
24
Need for vulnerability analysis in
GALS Architectures
Multiple domains, each driven by individual clocks
– Need for global clock network avoided
• Impact on AVF due to applying different
Reliability
Impact fine-grained
ignored
GALS enables
VF scaling tuned to
DVFS algorithms
individual domains
• Help designers choose DVFS algorithms
– DVFS provides high
performance
per watt
meeting
reliability requirements
DVFS algorithms for GALS architectures are
studied w.r.t IPC per watt
Voltage scaling affects FITraw, Frequency scaling
affects AVF
25
AVF impact across algorithms
Significant AVF
variations when
applying different
algorithms
Most DVFS
algorithms lead to
worser AVF than NonDVFS
Normalized AVF
1.5
1.4
1.3
Lower is
better
38% variation
1.2
Threshold
1.1
AD
1
ModAD
0.9
PI
0.8
Greedy
0.7
Issue Queue
26
26
Outline
Motivation
Contributions
Vulnerability bounding mechanisms
Other solutions
– Impact of DVFS on architectural vulnerability of
GALS architectures
– Address process variations in issue queue
– Mitigate NBTI, HCE degradation in structures
Conclusion and Future work
27
Process Variation (PV) - Introduction
Process Variation: Variation in characteristics between two identically
designed circuits
Process
•Performance
andVariation
Power impact significant
•Lack of predictability in timing characteristics lead
Dynamic
Static
to loss of yield
•Aging
Definite need to address PV at circuit
•Thermal Effects
Random
and microarchitectural
level
•Dose
Systematic
•RDF
Mean Number of Dopant Atoms
1
m
•Sub-wavelength
Lithography
•Overlay
Lithography
Wavelength
365nm
248nm
193nm
100
nm
180nm
130nm
Gap
90nm
65nm
Generation
45nm
32nm
1980
1000
100
10
1000
13nm
EUV
10
nm
10000
500
250
130
65
Technology Node (nm)
32
[J. Tschanz et al., DAC 2005]
28
1990
2000
2010
2020
Contributions
Study the impact of PV on the Issue Queue of a
microprocessor
 PV-unaware design has about 21% performance
degradation w.r.t Non-PV design
PV is a non-deterministic phenomenon. Designtime static partitioning not possible. Our solution
enables the fast and slow entries to co-exist
Instruction steering and sub-component
switching schemes to reduce the impact of PV
 Performance loss is about 1.3% w.r.t Non-PV design
29
Issue Queue Entry
Tag1
Tag N
Forwarding
Comparison
Opcode
V
R
Forwarding
Write
Tag Operand
R
Tag
ALLOC LOGIC
t+1
t+2
Alloc
stalls Dispatch
ISQ Full
Instruction wait for
Ready Operands
Operand
Dest Tag
Select
Logic
Dispatch
Write
t
Issue
Read
SELECT
INST. READY
INSTRUCTION
ISSUE
Valid Bit
Reset
t+3
Time
DISPATCH Valid Bit
WRITE
Set
FORWARDING Operand Ready
Bit Set
30
Results
Stalls reduced w.r.t specific
activity
IPC
1.45
1.3%
1.4
Operand and port-switching
further reduce stalls to a
minimum
1.35
12%
7.3%
1.3
1.25
1.2
Non-PV Shutdown MCD PV-Aware
31
Outline
Motivation
Contributions
Vulnerability bounding mechanisms
Other solutions
– Impact of DVFS on architectural vulnerability of
GALS architectures
– Address process variations in issue queue
– Mitigate NBTI, HCE degradation in structures
Conclusion and Future work
32
Increasing impact of
transistor wearout
Event
Related
(random)
Failure Rate
Infant
Mortality
Useful life (years)
Time
Source: Intel
Device
Wear-out
Decreasing
Technology
Transistor lifetime
decreasing with newer
technologies
Conservative Guardbands
impact performance
System longevity affects
revenue
More than 50% organizations,
machine-age > 10 years
Poll by Gartner Research, Source: J. Blome, Micro 2007
33
Contributions
NBTI, HCE impact increasing in upcoming
technologies
Conventional collapsing issue queues have
unwanted instruction movement across entries
– Collapsing required for age-based selection
Round-Robin scheme to provide restricted
collapsing
Restricted collapsing balances switching activity,
not losing much of age-based selection
34
Implementation
Capture Rd / Wr / Sw / Data
probabilities per cell
SPEC2K Benchmark
HSpice (32nm, 380K)
10-year degradation
100M instructions
Simplescalar
Typically,
solutions
Architectural
simulator
look at worst-case
probabilities
[ISQ]
Transistor-level
Degradation model
that might rarely occur
Read Delay
Degradation
35
Results
Performance
1% reduction
IPC
1.68
1.66
1.64
1.62
1.6
Conventional
Round Robin
16
Degradation (%)
1.7
18
14
12
10
8
Read Delay
32% reduction
Conventional
Round Robin
6
4
2
0
36
Conclusion
Growing Reliability concern
“Pop culture of reliability has arrived”
- Dr. Phil Emma, IBM [Architecture Design for Soft Errors]
Work looks at increasing the fault-tolerance
in back-end
– Soft errors
– Process variation
– Wearout
37
Current Work
Multi-core design have come to prominence
While cache have ECC, the multiple pipelines
involve structures holding data – ECC is hard
– Total vulnerability to soft errors increases
Study the impact on AVF of different structures
in a multi-core environment
38
Future Work
Multi-core
– Cores increase, market segments increase
– ILP vs TLP vs Clock frequency increase
– Application/Hardware sense best configuration
Reconfigurable Hardware
– Defect Tolerance
– Verification time increasing
– “Firmware update” to control functionality
39
40
Backup slides
41
DVFS Algorithms
Threshold
– VF scale use fixed thresholds. Preset thresholds affects
algorithm efficiency
Attack-Decay(AD)
– Based on util. in adjacent intervals. Attack whenever big util. change.
Otherwise decay. Greedy nature affects efficiency
Modified Attack-Decay (ModAD)
– Attack phase modified to correspond to util. change. Large VF swing
can affect performance per watt
PI
µk = µk-1 + KI (q’k – qref) + Kp (q’k – q’k-1)
fk = µk / IPC
Greedy
– Sample and Hold phase. VF scaling based on ED2 of past 2
intervals
42
Vulnerability Efficiency
Lower is
better
40% variation
Non-DVFS has the best
vulnerability efficiency
– On average, AD and PI
provide the best
vulnerability efficiency
43
Round Robin scheme
Head
Clk
Ctrl Bit
PseudoHead
(PH)
New
Inst
Clk
Tail
44
Clk
Ctrl Bit N
Ctrl Bit 0
1
1 1
0 0
PH
Collapse Control
Vector
Later
Entries
44
Reliability Issues of Importance
Solutions that are robust but overhead-aware
as well
45
Contributions
• Bounding vulnerability of
Hardware
Failure
Solutions to reduce
nonprocessor structures to
Permanent
Wearout
uniform aging due to NBTI,
provide reliability guarantees
HCE on microprocessor
structures
Temporary
• Study impact
of DVFS on
Solutions to address
impact
vulnerability of GALS architectures
of process variations on
issue queue
Transient
Radiation
Soft Errors
Intermittent
Process variation
Non-Radiation
Power supply
Source: ISCA 2005 tutorial
46
Results
SR with T(OOO)
0.9
SRT
SR
0.8
thread
Avg Performance w.r.t single
1
0.7
Throttling (T)
0.6
0.5
0.4
0%
20%
High Integrity
40%
60%
AVF Bounds
80%
100%
Low Integrity
47
Dest Tag
Non-Collapsing
Issue Queue
ISQ
Entry id
Decoder
Source Tags (STag1, STag2)
---
Assigns ISQ
Entry
Slow Entry Bit
---
Alloc
Dest
Tag
STALL
Demux
RAT
Op STag1 STag2 DTag
PV-aware steering - OptiSteer
Stall Optimization
Table
48
Intra-Entry Variation schemes
Operand- and Port-Switching
Op STag1 Operand1 STag2 DTag
V
Opcode
R
Tag
Operand
R
Tag
Operand
Issue Read
Dest Tag
Op STag2 STag1 Operand1 DTag
Dispatch Write
Port Switch
Dispatch
Operand Switch
Op STag1 Operand1 STag2 DTag
49
Timeline of ISQ activities
SELECT
INST. READY
Port
Switch
Slow issue read
SELECT
INST. READY
INSTRUCTION
ISSUE
Less instructions
selected
Valid Bit
Reset
t
t+1
ALLOC LOGIC
Port
Switch
Alloc
stalls Dispatch
t+2
Operand
Switch
ISQ Full
Slow Dispatch
Write
t+3
DISPATCH
WRITE
FORWARDING
SOT Fill
Instruction wait for
Ready Operands
Time
Valid Bit
Set
Operand Ready
Bit Set
SOT Value
Required
Forwarding Stall
50
Issue
Conventional Collapsing ISQ
Collapse
Head
Collapsing
Logic
Clk
0
1
2
Ctrl Bit N
N
Clk
Ctrl Bit 1
Tail
Age-ordering for
Instruction Selection
51
Round Robin scheme
Head
Collapse
Clk
PseudoHead
Ctrl Bit
New
Inst
Collapse
Tail
52 52
NBTI/HCE
NBTI – Traps due to negative voltage at gate
(input “0”)
– Dominant in PMOS transistor
– Increased when holding same data for long periods
HCE – Traps due to high electric field near the
drain
– Dominant in NMOS transistor
– Increased when switching activity is high
Vth shift accumulates over time, affects timing
53
Contributions
Global solutions
– Body Biasing
•PV is a non-deterministic phenomenon.
Our solution enables the fast and slow
Frequency boost increases leakage. Non-ideal for Issue Queue
entries to co-exist
– Time-borrowing
steering
and difficult
subAbsorbing clock•Instruction
jitter and skew
becomes
component switching schemes are
proposed to reduce the impact of PV
Structure-specific solutions
– Solutions for register file, and caches
Issue Queue performance-determining structure,
operation combines CAM, SRAM cells
54
Results
IPC
1.5
1.43
1.4
1.36
1.31
1.3
1.2
1.43
1.42
1.14
1.1
1
NonPV
PV-unAware
SpeedSteer
OptiSteer
55
Throughput comparison
10.5%
relative decrease
56
56
Switching Activity
57
57
Wearout phenomena
Negative Bias
Temperature Instability
G
I gd
D
N+
d
Ig
I gc
N+
cs
Ig
S
s
Hot Carrier Effects
• NBTI, HCE impact
increasing in upcoming
technologies
Oxide
Oxide
Igb
P-well
A. Tiwari, Micro 2008
S. Sapatnekar, ISQED
2006
B
Electro-Migration
Source: J. Blome. Micro 2007
Oxide Breakdown
•Factors
Temperature, switching
activity, data (gate bias), Vdd,
current density
58
Optimizations – Vulnerability Control Hybrid
RAT
Fetch
ARF
ISQ
ALU
Decode
RAT
Reduces bottleneck in inorder units like Result
Buffer
ARF
Reorder
Buffer (PRF)
Dispatch Bandwidth
not effectively utilized
59
Microprocessor Design:
Multi-Dimensional Problem
Data sensitivity – Application Dependent
Microprocessor design:
Performance not single dimension
– Power
– Thermal effects
– Reliability
Dimension-order driven by market
– Aircraft, Health-care:
Reliability
– Embedded: Power, Thermal
– Desktops, Game Consoles:
Performance
INTEGRITY LEVEL of APPLICATION DOMAIN
Application
Data
Integrity
Requirement
Market
Volume
Examples
Low
Integrity
Low
Huge
Consumer
Electronics
Moderate
Large
Present-day
Automotive
Very High
Moderate
Enterprise
Server
Small
Flight
Control
Moderate
Integrity
High
Integrity
Safety
Critical
Very High
Mitigation of Transient Faults at the System Level –60
the TTA approach. Herman Kopetz, SELSE 2006
GALS Architecture
Fetch
Domain 1
Domains driven byDVFS
individual
high performance per watt
clocks
– Domain is internally
Domain 2
synchronousGALS enables fine-grained VF scaling
tuned to individual domains
Reg
Careful tuning of global clock
distribution network is avoided
– Better frequency scaling
File
Domain 4
Different domains interact
through FIFO Buffers
Domain 3
Domain 2
Decode
Rename
Reg
Read
Reg
Read
Reg
Read
Int
ISQ
FP
ISQ
Mem
ISQ
Exec
Exec
Domain 5
Write
Back
Write
Back
Domain 3
Exec Domain 6
Write
Back
D-cache
Retire
61
Contributions
Reliability Impact ignored
DVFS algorithms for GALS architectures are
studied w.r.t IPC per watt
• Impact on architectural vulnerability du
to applying different DVFS algorithms
Voltage scaling affects• Characterize
FITraw, Frequency
the Vulnerability Efficien
scaling affects AVF (AVF*Watts/IPC) of DVFS algorithms
• Help designers choose DVFS algorithms
meeting reliability requirements
62