Download gem5, GPGPUSim, McPAT, GPUWattch,

For a number of years we have been familiar with the observation that the quality of programmers architecture researchers is a decreasing function of the density of go to statements reliance on quantitative architecture simulators in the programs architecture papers they produce. More recently we discovered why the use of the go to statement architecture simulators has such disastrous effects, and we became convinced that the go to statement architecture simulator should be abolished from all "higher level" programming languages architecture research. gem5, GPGPUSim, McPAT, GPUWattch, “Your favorite simulator here” Considered Harmful Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho, Karthikeyan Sankaralingam [email protected] University of Wisconsin - Madison 6/15/2014 (hate mail here please) Cycle Accurate Simulation Trace-Based Analysis How do I best evaluate my idea? Program Analysis Reasoned Arguments “Cycle-Accurate Simulation” Custom FirstOrder Models Mathematical Proofs How do I best get it published? 2 3 The Good, the bad, and the Ugly 4 Outline 5 McPAT Detailed Power Breakdown Pitfall 1: Simulator Errors Inaccessible to users. 6 Pitfall 2: Simulator Tool Misuse 7 This is how GPUWattch Actually Works power distribution activity counts GPU + McPAT GPGPUSim scaling factors (10-50x) • Scaling factors are fitted to the GPU being measured. • Changing GPGPUSim parameters would not yield a valid power model. 8 GPUWattch Pitfall 2: Simulator Tool Misuse 9 Pitfall 3: Mixed Abstractions with Unknown Consequences GPGPUSIM 2000 Register File Microarchitecture Warp Scheduler Branch Stack 10 Pitfall 3: Mixed Abstractions with Unknown Consequences GPGPUSIM 2000 Register File Microarchitecture Warp Scheduler Branch Stack 11 Pitfall4: Emerging Simulators (The Entire Core Microarchitecture) 12 Pitfall 5: The Trends Myth 13 Pitfall 5 Example Simulator bug: X86 Gem5 always predicts branches inside macro-ops as not taken. Program Trace Accelerated Region (100× faster) Unaccelerated Region Accelerator Benefit: 30% Benefit after error fix: 300% Unaccelerated Region Accelerated Region (100× faster) 14 Ex 5 4 Power Issue 3 Clock Cache 2 1 MMU 0 TPC Media Cloud 0 Performance 2 4 6 5 4 3 2 1 0 ED^4 Pitfall 6: Poor qualitative findings from data inundation. Energy 10 8 6 4 2 0 Spec NoC Spec TPC Media Cloud 00% CPI Stack 80% 60% 40% Random Heat Map 20% 0% 15 Pitfall 7: Amplified Reviewer Noise 16 “I have a trace-based, empirical, regression-fitted, cycle-accurate model.” “I’m using cycleaccurate simulation to evaluate my singlecycle per instruction architecture.” “Will someone please publish my mathematical model?” 17 Outline 18 “Footprint” Stack Layers Small Footprint Medium Footprint Large Footprint Algorithm Application Compiler OS IO Mem. Controller Caches Core µarch Circuits Gates Transistors Physics How do I evaluate my idea? Cycle Accurate Simulation Trace-Based Mathematical Analysis Proofs Custom FirstProgram Reasoned Order Models Analysis Arguments 19 The Footprint in Context Stack Layers Algorithm Application Compiler OS IO Mem. Controller Caches Core µarch Circuits Gates Transistors Physics Stay Away From Conservation the Valley Cores Temporal Memory Streaming Sampling DMR PC-Required Approach: “Cycle-Accurate Simulation” Appropriate Approach: Cycle-accurate Simulation Mathematical Models Custom First-Order Models What the authors Did: Cycle-accurate Simulation Above, but not Originally not a Conf. Publication More Than PC-Required Mathematical Proof PC-Required Approach 20 Tool Developers Authors a few humble requests… Reviewers 21 Opinions? http://sim-harmful.blogspot.com/ 22 Thank you! http://sim-harmful.blogspot.com/ 23 Backup Slides 24 Simulator Errors 25 Gem5 – Writeback Buffers • Implications: Inst. Queue FU FU … Execute – Each instruction holds a WB buffer slot during execute. – # Entries = WB_depth × WB_width – Default WBdepth=1 Issue • Issue: A default setting causes inappropriate slowdowns. • Details: FU “Writeback Buffer” – 5× Perf Loss in micro-benchmarks. – Up to 25% performance loss on Parsec workloads [Vamsi Krishna] 26 Gem5 – Pipeline Replay • Issue: Gem5 Pipeline Replay mechanism is unrealistic (both conservative and pessimistic) • Details: – Deep OOO pipelines speculatively schedule instructions for back-to-back execution, variable latency disrupts the schedule. – Conservative: Entire OOO pipeline (not just dependent insts.) repeatedly flushed on cash block. – Optimistic: Simple block on cache miss. • Implications: – 5× Perf Loss in micro-benchmarks. 27 Gem5 – µop Inefficiencies • Issue: Flaws and inefficiencies in gem5 µops. • Details (examples): – Instructions which read destination register, but don’t have to (mov) – Same as above, but with flags register (xor, and, or…) – Instructions which are marked as requiring the FP unit, but don’t need it (ldfp, mov2fp) – SIMD FP instructions aren’t counted as FP. (mulps …) • Implications: – Incorrect performance and energy estimates (especially for FP code) 28 McPAT – Fitted Values • Issue: McPAT relies on some design-specific constants, users should take caution generalizing. • Details (examples): – Dynamic component of energy added for FU if the design is OOO. – Per-access FU energy divided by 2 if processor type is embedded. – Percentage chip using long-channel devices set by values from Niagara vs Xeon Tulsa. • Implications: – Users can easily use tool outside of validated range. 29 McPAT – Pipeline & Clock Power – McPAT models pipeline/clock power considering average switching factors. – This power is reported by distributing it among stages. – A factor of cycles is lost for OOO processors in this calculation. • Implications: – Incorrect power estimates for experiments longer than a few cycles. McPAT Pipeline Power for a 65nm Idle Processor Pipeline Power (watts) • Issue: McPAT drops pipeline and clock power for OOO designs. • Details (examples): 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Inorder OOO 1 10 100 Total Experiment Cycles 30 GPGPUSim V2.x • Issue: Certain features in GPGPUSim V2.x are modeled functionally and aren’t appropriate for evaluating their micro-architectural enhancements. • Details: – Register File microarchitecture: The operand collector modeled assuming fixed latency accesses to the SRAM with some additional queuing latency. (lacks contention) – Thread/warp/wavefront scheduling and dispatch: Different warp scheduling schemes implemented, but schedules are generated functionally. – Branch divergence structures and Branch Unit: Tracking structures are functionally emulated as part of the abstract hardware model, and the branch unit microarchitecture is not modeled at the cycle-level. • Implications: – It is possible to do research with this tool while remaining detached from micro-architecture. [These issues largely fixed in GPGPUSim V3.x] 31 GPUWattch 1. In the methodology as presented, the McPAT modeling is mathematically irrelevant. 2. Implementation bounds the McPAT scaling factors (10-50×). – Not mathematically irrelevant. – Enables plausible distribution. 3. Potential for inappropriate use: Usage GPGPUSim Training Scaled Activity Factors Modified McPAT Train with measured GPU Component-wise Power – Modifying GPU configuration without re-training would yield invalid models. – Scaling factors too high to trust power will behave like CPU components. 32 GPUWattch • Issue: Following the methodology as presented will lead to “Irrelevant modeling” Errors. • Details: 𝑃𝑏𝑒𝑛𝑐ℎ = 𝛼𝑏𝑒𝑛𝑐ℎ,𝑐𝑜𝑚𝑝 × 𝑃𝑀𝐴𝑋𝑐𝑜𝑚𝑝 × 𝑥𝑐𝑜𝑚𝑝 – Power eq: ∀𝑐𝑜𝑚𝑝 Power Estimate Activity Factors Max power of component Scaling Factor – Authors determine scaling factors by minimizing squared error for measured values. [linear regression] – Scaling by constant factor is meaningless, this is equivalent: 𝑃 = 𝛼 × 𝑥′ 𝑏𝑒𝑛𝑐ℎ 𝑏𝑒𝑛𝑐ℎ,𝑐𝑜𝑚𝑝 𝑐𝑜𝑚𝑝 ∀𝑐𝑜𝑚𝑝 – Implication: Users of GPUWattch methodology will waste effort in detailed modeling! 33 GPUWattch – A clarification • GPUWattch authors actually bound the 𝑥𝑐𝑜𝑚𝑝 scaling factors: 𝑃𝑏𝑒𝑛𝑐ℎ = 𝛼𝑏𝑒𝑛𝑐ℎ,𝑐𝑜𝑚𝑝 × 𝑃𝑀𝐴𝑋𝑐𝑜𝑚𝑝 × 𝑥𝑐𝑜𝑚𝑝 ∀𝑐𝑜𝑚𝑝 𝒙𝒎𝒊𝒏 ≤ 𝒙𝒄𝒐𝒎𝒑 ≤ 𝒙𝒎𝒂𝒙 • Factors are ≤10× for core, and ≤50× for uncore • Therefore, this is not a pure linear regression, and 𝑃𝑀𝐴𝑋𝑐𝑜𝑚𝑝 is not mathematically irrelevant. • In practice, these bounds keep the scaling factors closer to reality (not negative or extremely high) • However, this leads to a potential “user-error” … 34 GPUWattch • Issue: GPUWattch methodology restricts itself to GPUs with physically measurable artifacts. • Details: – Scaling factors are unique to a particular platform. – GPUWattch relies on McPAT’s scaling for capturing effects between different GPU configurations. – Claim: Relying on McPAT for this is highly questionable without evidence. – Can McPAT get the absolute value so wrong (8× and 22× average scaling factors for the validated platforms), but somehow get the relative scaling correct? • Implication: GPUWattch restricted to small-footprint research evaluation. 35

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download gem5, GPGPUSim, McPAT, GPUWattch,