* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Multiple Clock and Voltage Domains for Chip Multi Processors
Power factor wikipedia , lookup
Electrical ballast wikipedia , lookup
Resistive opto-isolator wikipedia , lookup
Electrification wikipedia , lookup
Power over Ethernet wikipedia , lookup
Immunity-aware programming wikipedia , lookup
Pulse-width modulation wikipedia , lookup
Opto-isolator wikipedia , lookup
Audio power wikipedia , lookup
Electric power system wikipedia , lookup
Variable-frequency drive wikipedia , lookup
Power inverter wikipedia , lookup
Electrical substation wikipedia , lookup
Three-phase electric power wikipedia , lookup
Time-to-digital converter wikipedia , lookup
Amtrak's 25 Hz traction power system wikipedia , lookup
Power MOSFET wikipedia , lookup
Power engineering wikipedia , lookup
Surge protector wikipedia , lookup
Voltage regulator wikipedia , lookup
Buck converter wikipedia , lookup
History of electric power transmission wikipedia , lookup
Distribution management system wikipedia , lookup
Stray voltage wikipedia , lookup
Power supply wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Voltage optimisation wikipedia , lookup
Multiple Clock and Voltage Domains for Chip Multi Processors Efraim Rotem Intel Corporation, Israel Ran Ginosar Technion, Israel Avi Mendelson Microsoft R&D, Israel Uri Weiser Technion, Israel December - 2009 Dec-2009 Chip with Multiple Clock and Voltage Domains 1 Compute Performance matters We would like to keep on providing performance – Power is #1 limiter Both process technology and ILP slow down multi core architectures 10,000 An order of magnitude more power efficient but deep in the power wall 100W 1,000 10W 100 Fueled by a combination of process and arch 10 1 Source: Dave Patterson1W 1978 Dec-2009 1982 1986 1990 1994 Chip with Multiple Clock and Voltage Domains 1998 2002 2006 2 Work Overview - scope • How to best architect and manage Clock and voltage domains of a CMP to max performance under power constraints • 16 core Power constrained CMP CPU • 1 thru 16 voltage regulators (VR) PMU – Either on chip or off chip VR DC/DC VR – FIFO buffers increase latency • Paper contributions: Core PE #1 #1 DC/DC VR Core PE #2 #2 DC/DC VR Core PE #n #n Interconnect • 1 thru 16 clock domains FIFO Buffer L2 Cache Cache – Power delivery constrains DVFS • Multi-voltage domains not so easy – Methodology to evaluate CMP workloads – Clustered voltage and clock domains I/O and Memory Dec-2009 Chip with Multiple Clock and Voltage Domains 3 Operation point and constraints • Process technology voltages – Voltage range Vmin – Vmax – Frequency range fmin – 2fmin – Nominal working point Vmin , fmin • Lower bound on quality of service – Frequency DFS down to ½ fmin • Total power is a constraint – Not exceed nominal power • Power delivery has been added as a constraint • Most constraining parameter wins Dec-2009 Chip with Multiple Clock and Voltage Domains 4 Why is VR a constraint? Simplified example • Given a 16 core 100A shared power delivery – Tying all cores together allows sharing current among cores – Allow one core to consume all the current • Assume we can split the same VR into 16 – Allow each core a fixed 100A/16 – Sharing is not possible I/16 I/16 – Keeping capability requires 1,600A! I I/16 I/16 I/16 I/16 I/16 I/16 I/16 I/16 Core Dec-2009 I/16 Chip with Multiple Clock and Voltage Domains I/16 5 Power delivery is constrained • Need power delivery headroom for performance • Replacing 1 VR by 16 individual VRs: – Does not allow current sharing between cores – Results in degraded power delivery • New technologies: – Need less area / volume, BUT – Still deliver limited current • More details in the paper Dec-2009 6 Modeling methodology Workload construction Dec - 2009 Chip with Multiple Clock and Voltage Domains 7 Hybrid model • Offline characterization of a real CPU: – Instrumented Intel® Core™-2 Duo for power performance measurements – Characterized SPEC-2K traces behavior – Extracted DVFS parameters and V/F scaling • Cycle accurate simulation for FIFO impacts – 3 clocks each direction • Coded analytic model to calculate performance – Function of power frequency and workload Dec-2009 Chip with Multiple Clock and Voltage Domains 8 Workload construction • Typical Multi Threaded benchmarks insufficient – Server or HPC centric • Highly regular and uniform – But client and cloud computing is non uniform • We performed Monte-Carlo simulation – – – – – Used SPEC-2K as an application pool Randomly assigned a subset of 16 threads to the cores Both fully and partially threaded studies Performed all studies on the same workload Repeated workload selection and analysis 200 times Dec-2009 Chip with Multiple Clock and Voltage Domains 9 Results Dec-2009 Chip with Multiple Clock and Voltage Domains 10 Baseline: Single Voltage and Clock DVFS • 10-25% performance gain from use of power headroom • Serves as baseline for the studies to follow • 200 random workloads Baseline performance gain • DVFS to lowest constraint Performance [relative to base frequency] 130% 140% = 16XCrafty • Sorted by performance 125% • Shown relative performance 120% 115% Baseline 110% 100% = 16XGalgel 105% I 100% 1 21 20 41 40 61 60 81 80 101 100 121 120 141 140 161 160 181 180 200 Workload Dec-2009 Chip with Multiple Clock and Voltage Domains Core 11 Different topologies - Fully threaded workloads • Example with power supply capability of 150% • Some workloads gain performance, some lose compared to baseline – In contrast with previous studies – Assign budget asymmetrically • 200 random workloads Relative Performance • Oracle study Relative performance [%] 6% • Three topologies vs. baseline 4% 50% apps better perf 2% • Each Sorted independently • Performance relative to baseline 50% apps Loose perf 0% 20 40 60 80 100 120 140 160 180 -2% nVnC / 1V1C 1VnC / 1V1C nVnC / 1VnC -4% -6% Workloads (sorted) Dec-2009 Chip with Multiple Clock and Voltage Domains 1V – Single voltage domain nV – Multiple Voltage domains 1C – Single Clock domain nC – Multiple Clock domains 12 Partially threaded workload • Fewer threads higher benefit from shared power Performance vs. Threads and policy 250% headroom 160% 1V1C nVnC 1VnC 155% 150% Perofrmance 145% Multi VR better 140% 135% 130% 125% Single VR better 120% 115% 110% 2T 4T 8T 12T 14T 16T Number of threads 1V – Single voltage domain nV – Multiple Voltage domains Oracle Study Dec-2009 1C – Single Clock domain nC – Multiple Clock domains Chip with Multiple Clock and Voltage Domains 13 Gaining the best of both worlds: Clusters • N clusters with 16/N cores each • Sharing VR between cores in a cluster • Setting optimal voltage frequency for each cluster I /4 I /4 I /4 I /4 Dec-2009 Chip with Multiple Clock and Voltage Domains 14 Clusters • Clustered topology almost equal to the best of both topologies • Outperforms both when number of threads = number of clusters Performance vs. Treads and policy 250% headroom 160% Cluster always the best 155% 1V1C nVnC 145% nVnC-8C-SM Perofrmance 150% 140% 135% 130% 125% 120% 115% 110% 2T 4T 8T 12T 14T 16T Number of threads 1V – Single voltage domain nV – Multiple Voltage domains 1C – Single Clock domain nC – Multiple Clock domains xT – X Threads 15 Dec-2009 Chip with Multiple Clock and Voltage Domains How to pick the best cluster size? • • • • • Oracle study Compared to non-clustered (by workload) Calculated quadratic error from best topology Best scenarios highlighted “Diagonal behavior” – More constrained power delivery larger clusters 1V1C 1VnC nVnC-2C nVnC-4C nVnC-8C 110% 7.1% 5.1% 28.6% 45.8% 55.6% 130% 11.4% 9.0% 13.0% 14.7% 21.9% 150% 13.2% 10.7% 14.1% 13.3% 16.5% 200% 14.8% 12.4% 15.4% 12.2% 9.8% 250% 16.6% 14.1% 17.5% 13.9% 7.6% Columns – power delivery capability Rows – number of clusters Cells showing distance from Oracle (Smaller is better) Dec-2009 Chip with Multiple Clock and Voltage Domains 16 Summary • Power delivery is a major CPU perf. constraint – Overlooked by previous works – Multiple voltage domain do not allow power sharing – Lightly threaded workloads are most constrained • Clustered topology mitigates sharing limitations – Allows sharing power within subsets of cores – Optimal cluster size: function of power delivery capability • Explored the non uniform workloads – Different application types – Partially vs. fully threaded workloads Dec-2009 Chip with Multiple Clock and Voltage Domains 17 Thank You Dec-2009 Chip with Multiple Clock and Voltage Domains 18 Run time policies • Policy to: – Evaluate run time parameters and select frequency • Three control functions – Input: power or scalability – Compute: frequency for each core • Scale each domain to lowest constrain (e.g. power delivery, max freq) • Calculated quadratic error from Oracle results Linear Polynomial Freq. Freq. Freq. Greedy (Winner Takes All) Linear dependency F 3 Parm Input – Power / Scalability Input – Power / Scalability Dec-2009 Chip with Multiple Clock and Voltage Domains Input – Power / Scalability 19 Run time policy results • Winning policy is a greedy (WTA) based on scalability – Very close to Oracle • Random and power based policies are not good policies WTA 50% WTA 33% WTA 10% WTA by Power 50% Linear by SCA Linear by power Polynomial by SCA Random 1VnC Max 5.84% 4.41% 1.23% 22.76% 9.60% 49.76% 5.23% 33.28% Average 1.3% 0.6% 0.0% 6.9% 6.1% 36.6% 3.3% 19.9% nVnC Max WTA 50% WTA 33% WTA 10% WTA by Power 50% Linear by SCA Linear by power Polinomial by SCA Random Average 2.90% 3.37% 4.63% 4.60% 2.72% 5.77% 3.58% 8.66% 0.8% 0.8% 1.7% 2.3% 1.5% 3.8% 1.5% 4.3% Distance from Oracle (Smaller is better) WTA – Winner Take All SCA - Scalability Dec-2009 Chip with Multiple Clock and Voltage Domains 20 Workload characterization SPEC int A B C Scaled Power Perf. Scaling with freq. FIFO impact • • Measured score at two frequencies Measured total CPU power gzip 48% 0.95 0.13% – Scaled power = (Workload Power)/(Max Power) vpr 44% 0.68 2.92% – Results 33%-100% gcc 35% 0.67 0.92% mcf 49% 0.30 2.92% crafty 33% 0.99 0.59% parser 60% 0.78 1.29% eon 42% 0.99 0.00% perlbmk 50% 1.00 0.31% gap 45% 0.56 1.14% vortex 60% 0.73 1.45% bzip2 49% 0.70 0.71% • Low Memory bound twolf 97% 0.99 4.68% • High CPU bound Int_rate 51% 0.77 1.42% Dec-2009 A • leakage + Idle is ~30% – Most applications use less than 100% power • Even at Vmax , fmax they consume less than Imax • Reason: Not all parts of the CPU are utilized • Scalability = ΔPerf/ΔFrequency – Result 0%-100% Chip with Multiple Clock and Voltage Domains B 21 Workload characterization A B C C SPEC int Scaled Power Perf. Scaling with freq. FIFO impact gzip 48% 0.95 0.13% vpr 44% 0.68 2.92% gcc 35% 0.67 0.92% mcf 49% 0.30 2.92% crafty 33% 0.99 0.59% parser 60% 0.78 1.29% eon 42% 0.99 0.00% perlbmk 50% 1.00 0.31% gap 45% 0.56 1.14% vortex 60% 0.73 1.45% bzip2 49% 0.70 0.71% twolf 97% 0.99 4.68% Int_rate 51% 0.77 1.42% Dec-2009 • Used cycle accurate simulation to evaluate FIFO impact / application All studies are average over the entire run, not accounting for variance over time Study applies also to phases in workload Chip with Multiple Clock and Voltage Domains 22 Some DVFS model details 1.00 0.90 Relative Leakage All models are built with relative values and not absolute voltages, freq. or performance From min Vcc – linear scaling of frequency only Leakage vs. Voltage 1.10 0.80 0.70 Leakage 0.60 X^3 Approximation 0.50 0.40 0.30 0.20 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 Vcc [relative] Frequency as a function of V_gate Chart Title 4 120.0% 0.4414 y = 1.0102x R2 = 0.9986 3.5 80.0% 60.0% Pow er to Freq 40.0% Vcc [V] [GHz] Freq Frequency [%] 100.0% 3 Linear freq 2.5 Pow er (Pow er to Freq) Actual Freq. 2 20.0% 0.0% 0% 20% 40% 60% Power [%] Dec-2009 80% 100% 120% 1.5 0.60 0.70 Chip with Multiple Clock and Voltage Domains 0.80 0.90 Voltage [V] Freq [GHz] 1.00 1.10 1.20 23 Workload characteristics – few observations • Application power is distributed around ~60% of max power Application Power distribution 10 Apps power distribution 8 – Min 33% - Leakage + idle power • Scalability is evenly distributed • No correlation found between power and scalability 6 Probability – Very few apps reach 100% Norm Dist 4 2 0 0% 20% 60% 80% 100% 120% -2 Appplication count – OOO characteristics Performance Scaling Score vs. Power • Random pick of 16 cores: – Tighter overall power distribution – Very low probability for all application high or low power 1.20 1.00 Scaling [Perf/freq] – Simpler core is expected to show positive correlation Dec-2009 40% 0.80 0.60 0.40 0.20 0.00 0% Chip with Multiple Clock and Voltage Domains 20% 40% 60% 80% 100% Power [% of max] 24 120% Why is VR constraint - physics Battery Bulk Cap. Need close proximity Controller Drivers Inductors CPU GFX Dec-2009 Chip with Multiple Clock and Voltage Domains 25 Overview • How to best architect and manage Clock and voltage domains of a CMP to achieve max performance under power constraints • Contributions: – Power delivery constrains DVFS • Multi-voltage domains not so easy – Methodology to evaluate CMP workloads – Clustered voltage and clock domains Dec-2009 Chip with Multiple Clock and Voltage Domains 26 Work Overview - scope • 16 core Power constrained CMP • 1 thru 16 voltage regulators (VR) and clock domains – Either on chip or off chip VR CPU • Independent clock domains require a FIFO buffer increased latency PMU DC/DC VR Core PE #1 #1 DC/DC VR Core PE #2 #2 DC/DC VR Core PE #n #n Interconnect FIFO Buffer L2 Cache Cache Best topology ? Optimal policy ? Under constraints I/O and Memory Dec-2009 Chip with Multiple Clock and Voltage Domains 27