Download Multiple Clock and Voltage Domains for Chip Multi Processors

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Power factor wikipedia , lookup

Electrical ballast wikipedia , lookup

Resistive opto-isolator wikipedia , lookup

Electrification wikipedia , lookup

Power over Ethernet wikipedia , lookup

Decibel wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Opto-isolator wikipedia , lookup

Audio power wikipedia , lookup

Electric power system wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Power inverter wikipedia , lookup

Electrical substation wikipedia , lookup

Three-phase electric power wikipedia , lookup

Islanding wikipedia , lookup

Rectifier wikipedia , lookup

Triode wikipedia , lookup

Time-to-digital converter wikipedia , lookup

Amtrak's 25 Hz traction power system wikipedia , lookup

Power MOSFET wikipedia , lookup

Power engineering wikipedia , lookup

Surge protector wikipedia , lookup

Voltage regulator wikipedia , lookup

Buck converter wikipedia , lookup

History of electric power transmission wikipedia , lookup

Distribution management system wikipedia , lookup

Stray voltage wikipedia , lookup

Power supply wikipedia , lookup

AC adapter wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Voltage optimisation wikipedia , lookup

Alternating current wikipedia , lookup

Mains electricity wikipedia , lookup

Transcript
Multiple Clock and Voltage Domains
for Chip Multi Processors
Efraim Rotem
Intel Corporation, Israel
Ran Ginosar
Technion, Israel
Avi Mendelson
Microsoft R&D, Israel
Uri Weiser
Technion, Israel
December - 2009
Dec-2009
Chip with Multiple Clock and Voltage Domains
1
Compute Performance matters
We would like to keep on providing performance – Power is #1 limiter
Both process technology and ILP slow down  multi core architectures
10,000
An order of magnitude
more power efficient but
deep in the power wall
100W
1,000
10W
100
Fueled by a combination
of process and arch
10
1
Source: Dave Patterson1W
1978
Dec-2009
1982
1986
1990
1994
Chip with Multiple Clock and Voltage Domains
1998
2002
2006
2
Work Overview - scope
• How to best architect and manage Clock and voltage domains
of a CMP to max performance under power constraints
• 16 core Power constrained CMP
CPU
• 1 thru 16 voltage regulators (VR)
PMU
– Either on chip or off chip VR
DC/DC
VR
– FIFO buffers increase latency
• Paper contributions:
Core
PE #1
#1
DC/DC
VR
Core
PE #2
#2
DC/DC
VR
Core
PE #n
#n
Interconnect
• 1 thru 16 clock domains
FIFO
Buffer
L2
Cache
Cache
– Power delivery constrains DVFS
• Multi-voltage domains not so easy
– Methodology to evaluate CMP workloads
– Clustered voltage and clock domains
I/O and Memory
Dec-2009
Chip with Multiple Clock and Voltage Domains
3
Operation point and constraints
• Process technology voltages
– Voltage range Vmin – Vmax
– Frequency range fmin – 2fmin
– Nominal working point Vmin , fmin
• Lower bound on quality of service
– Frequency DFS down to ½ fmin
• Total power is a constraint
– Not exceed nominal power
• Power delivery has been added as a constraint
• Most constraining parameter wins
Dec-2009
Chip with Multiple Clock and Voltage Domains
4
Why is VR a constraint? Simplified example
• Given a 16 core 100A shared power delivery
– Tying all cores together allows sharing current among cores
– Allow one core to consume all the current
• Assume we can split the same VR into 16
– Allow each core a fixed 100A/16
– Sharing is not possible
I/16
I/16
– Keeping capability requires 1,600A!
I
I/16
I/16
I/16
I/16
I/16
I/16
I/16
I/16
Core
Dec-2009
I/16
Chip with Multiple Clock and Voltage Domains
I/16
5
Power delivery is constrained
• Need power delivery headroom for performance
• Replacing 1 VR by 16 individual VRs:
– Does not allow current sharing between cores
– Results in degraded power delivery
• New technologies:
– Need less area / volume, BUT
– Still deliver limited current
• More details in the paper
Dec-2009
6
Modeling methodology
Workload construction
Dec - 2009
Chip with Multiple Clock and Voltage Domains
7
Hybrid model
• Offline characterization of a real CPU:
– Instrumented Intel® Core™-2 Duo for power
performance measurements
– Characterized SPEC-2K traces behavior
– Extracted DVFS parameters and V/F scaling
• Cycle accurate simulation for FIFO impacts
– 3 clocks each direction
• Coded analytic model to calculate performance
– Function of power frequency and workload
Dec-2009
Chip with Multiple Clock and Voltage Domains
8
Workload construction
• Typical Multi Threaded benchmarks insufficient
– Server or HPC centric
• Highly regular and uniform
– But client and cloud computing is non uniform
• We performed Monte-Carlo simulation
–
–
–
–
–
Used SPEC-2K as an application pool
Randomly assigned a subset of 16 threads to the cores
Both fully and partially threaded studies
Performed all studies on the same workload
Repeated workload selection and analysis 200 times
Dec-2009
Chip with Multiple Clock and Voltage Domains
9
Results
Dec-2009
Chip with Multiple Clock and Voltage Domains
10
Baseline: Single Voltage and Clock DVFS
• 10-25% performance gain from use of power headroom
• Serves as baseline for the studies to follow
• 200 random workloads
Baseline performance gain
• DVFS to lowest constraint
Performance [relative to base frequency]
130%
140% = 16XCrafty
• Sorted by performance
125%
• Shown relative performance
120%
115%
Baseline
110%
100% = 16XGalgel
105%
I
100%
1
21
20
41
40
61
60
81
80
101
100
121
120
141
140
161
160
181
180
200
Workload
Dec-2009
Chip with Multiple Clock and Voltage Domains
Core
11
Different topologies - Fully threaded workloads
• Example with power supply capability of 150%
• Some workloads gain performance, some lose compared to baseline
– In contrast with previous studies – Assign budget asymmetrically
• 200 random workloads
Relative Performance
• Oracle study
Relative performance [%]
6%
• Three topologies vs. baseline
4%
50% apps
better perf
2%
• Each Sorted independently
• Performance relative to baseline
50% apps
Loose perf
0%
20
40
60
80
100 120 140 160
180
-2%
nVnC / 1V1C
1VnC / 1V1C
nVnC / 1VnC
-4%
-6%
Workloads (sorted)
Dec-2009
Chip with Multiple Clock and Voltage Domains
1V – Single voltage domain
nV – Multiple Voltage domains
1C – Single Clock domain
nC – Multiple Clock domains
12
Partially threaded workload
• Fewer threads  higher benefit from shared power
Performance vs. Threads and policy 250% headroom
160%
1V1C
nVnC
1VnC
155%
150%
Perofrmance
145%
Multi VR better
140%
135%
130%
125%
Single VR better
120%
115%
110%
2T
4T
8T
12T
14T
16T
Number of threads
1V – Single voltage domain
nV – Multiple Voltage domains
Oracle Study
Dec-2009
1C – Single Clock domain
nC – Multiple Clock domains
Chip with Multiple Clock and Voltage Domains
13
Gaining the best of both worlds: Clusters
• N clusters with 16/N cores each
• Sharing VR between cores in a cluster
• Setting optimal voltage frequency for each cluster
I /4
I /4
I /4
I /4
Dec-2009
Chip with Multiple Clock and Voltage Domains
14
Clusters
• Clustered topology almost equal to the best of both topologies
• Outperforms both when number of threads = number of clusters
Performance vs. Treads and policy
250% headroom
160%
Cluster always the best
155%
1V1C
nVnC
145%
nVnC-8C-SM
Perofrmance
150%
140%
135%
130%
125%
120%
115%
110%
2T
4T
8T
12T
14T
16T
Number of threads
1V – Single voltage domain
nV – Multiple Voltage domains
1C – Single Clock domain
nC – Multiple Clock domains
xT – X Threads
15
Dec-2009
Chip with Multiple Clock and Voltage Domains
How to pick the best cluster size?
•
•
•
•
•
Oracle study
Compared to non-clustered (by workload)
Calculated quadratic error from best topology
Best scenarios highlighted
“Diagonal behavior”
– More constrained power delivery  larger clusters
1V1C
1VnC
nVnC-2C
nVnC-4C
nVnC-8C
110%
7.1%
5.1%
28.6%
45.8%
55.6%
130%
11.4%
9.0%
13.0%
14.7%
21.9%
150%
13.2%
10.7%
14.1%
13.3%
16.5%
200%
14.8%
12.4%
15.4%
12.2%
9.8%
250%
16.6%
14.1%
17.5%
13.9%
7.6%
Columns – power delivery capability
Rows – number of clusters
Cells showing distance from Oracle (Smaller is better)
Dec-2009
Chip with Multiple Clock and Voltage Domains
16
Summary
• Power delivery is a major CPU perf. constraint
– Overlooked by previous works
– Multiple voltage domain do not allow power sharing
– Lightly threaded workloads are most constrained
• Clustered topology mitigates sharing limitations
– Allows sharing power within subsets of cores
– Optimal cluster size: function of power delivery capability
• Explored the non uniform workloads
– Different application types
– Partially vs. fully threaded workloads
Dec-2009
Chip with Multiple Clock and Voltage Domains
17
Thank You
Dec-2009
Chip with Multiple Clock and Voltage Domains
18
Run time policies
• Policy to:
– Evaluate run time parameters and select frequency
• Three control functions
– Input: power or scalability
– Compute: frequency for each core
• Scale each domain to lowest constrain (e.g. power delivery, max freq)
• Calculated quadratic error from Oracle results
Linear
Polynomial
Freq.
Freq.
Freq.
Greedy (Winner Takes All)
Linear dependency
F  3 Parm
Input – Power / Scalability
Input – Power / Scalability
Dec-2009
Chip with Multiple Clock and Voltage Domains
Input – Power / Scalability
19
Run time policy results
•
Winning policy is a greedy (WTA) based on scalability
– Very close to Oracle
•
Random and power based policies are not good policies
WTA 50%
WTA 33%
WTA 10%
WTA by Power 50%
Linear by SCA
Linear by power
Polynomial by SCA
Random
1VnC
Max
5.84%
4.41%
1.23%
22.76%
9.60%
49.76%
5.23%
33.28%
Average
1.3%
0.6%
0.0%
6.9%
6.1%
36.6%
3.3%
19.9%
nVnC
Max
WTA 50%
WTA 33%
WTA 10%
WTA by Power 50%
Linear by SCA
Linear by power
Polinomial by SCA
Random
Average
2.90%
3.37%
4.63%
4.60%
2.72%
5.77%
3.58%
8.66%
0.8%
0.8%
1.7%
2.3%
1.5%
3.8%
1.5%
4.3%
Distance from Oracle (Smaller is better)
WTA – Winner Take All
SCA - Scalability
Dec-2009
Chip with Multiple Clock and Voltage Domains
20
Workload characterization
SPEC int
A
B
C
Scaled
Power
Perf.
Scaling
with freq.
FIFO
impact
•
•
Measured score at two frequencies
Measured total CPU power
gzip
48%
0.95
0.13%
– Scaled power =
(Workload Power)/(Max Power)
vpr
44%
0.68
2.92%
– Results 33%-100%
gcc
35%
0.67
0.92%
mcf
49%
0.30
2.92%
crafty
33%
0.99
0.59%
parser
60%
0.78
1.29%
eon
42%
0.99
0.00%
perlbmk
50%
1.00
0.31%
gap
45%
0.56
1.14%
vortex
60%
0.73
1.45%
bzip2
49%
0.70
0.71%
• Low  Memory bound
twolf
97%
0.99
4.68%
• High  CPU bound
Int_rate
51%
0.77
1.42%
Dec-2009
A
• leakage + Idle is ~30%
– Most applications use less than 100%
power
• Even at Vmax , fmax they consume less
than Imax
• Reason: Not all parts of the CPU are
utilized
•
Scalability = ΔPerf/ΔFrequency
– Result 0%-100%
Chip with Multiple Clock and Voltage Domains
B
21
Workload characterization
A
B
C
C
SPEC int
Scaled
Power
Perf.
Scaling
with freq.
FIFO
impact
gzip
48%
0.95
0.13%
vpr
44%
0.68
2.92%
gcc
35%
0.67
0.92%
mcf
49%
0.30
2.92%
crafty
33%
0.99
0.59%
parser
60%
0.78
1.29%
eon
42%
0.99
0.00%
perlbmk
50%
1.00
0.31%
gap
45%
0.56
1.14%
vortex
60%
0.73
1.45%
bzip2
49%
0.70
0.71%
twolf
97%
0.99
4.68%
Int_rate
51%
0.77
1.42%
Dec-2009
• Used cycle accurate simulation to
evaluate FIFO impact / application
All studies are average over
the entire run, not accounting
for variance over time
Study applies also to phases
in workload
Chip with Multiple Clock and Voltage Domains
22
Some DVFS model details
1.00
0.90
Relative Leakage
All models are built with
relative values and not
absolute voltages, freq.
or performance
From min Vcc – linear
scaling of frequency only
Leakage vs. Voltage
1.10
0.80
0.70
Leakage
0.60
X^3 Approximation
0.50
0.40
0.30
0.20
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
Vcc [relative]
Frequency as a function of V_gate
Chart Title
4
120.0%
0.4414
y = 1.0102x
R2 = 0.9986
3.5
80.0%
60.0%
Pow er to Freq
40.0%
Vcc
[V]
[GHz]
Freq
Frequency [%]
100.0%
3
Linear freq
2.5
Pow er (Pow er to Freq)
Actual Freq.
2
20.0%
0.0%
0%
20%
40%
60%
Power [%]
Dec-2009
80%
100%
120%
1.5
0.60
0.70
Chip with Multiple Clock and Voltage Domains
0.80
0.90
Voltage
[V]
Freq [GHz]
1.00
1.10
1.20
23
Workload characteristics – few observations
• Application power is distributed
around ~60% of max power
Application Power distribution
10
Apps power distribution
8
– Min 33% - Leakage + idle power
• Scalability is evenly distributed
• No correlation found between
power and scalability
6
Probability
– Very few apps reach 100%
Norm Dist
4
2
0
0%
20%
60%
80%
100%
120%
-2
Appplication count
– OOO characteristics
Performance Scaling Score vs. Power
• Random pick of 16 cores:
– Tighter overall power distribution
– Very low probability for all
application high or low power
1.20
1.00
Scaling [Perf/freq]
– Simpler core is expected to show
positive correlation
Dec-2009
40%
0.80
0.60
0.40
0.20
0.00
0%
Chip with Multiple Clock and Voltage Domains
20%
40%
60%
80%
100%
Power [% of max]
24
120%
Why is VR constraint - physics
Battery
Bulk Cap.
Need close proximity
Controller
Drivers
Inductors
CPU
GFX
Dec-2009
Chip with Multiple Clock and Voltage Domains
25
Overview
• How to best architect and manage Clock and
voltage domains of a CMP to achieve max
performance under power constraints
• Contributions:
– Power delivery constrains DVFS
• Multi-voltage domains not so easy
– Methodology to evaluate CMP workloads
– Clustered voltage and clock domains
Dec-2009
Chip with Multiple Clock and Voltage Domains
26
Work Overview - scope
• 16 core Power constrained CMP
• 1 thru 16 voltage regulators (VR) and clock domains
– Either on chip or off chip VR
CPU
• Independent clock domains require
a FIFO buffer  increased latency
PMU
DC/DC
VR
Core
PE #1
#1
DC/DC
VR
Core
PE #2
#2
DC/DC
VR
Core
PE #n
#n
Interconnect
FIFO
Buffer
L2
Cache
Cache
Best topology ?
Optimal policy ?
Under constraints
I/O and Memory
Dec-2009
Chip with Multiple Clock and Voltage Domains
27