Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Performance Technology for Productive, High-End Parallel Computing Allen D. Malony [email protected] Department of Computer and Information Science Performance Research Laboratory University of Oregon Outline of Talk Research motivation Scalability, productivity, and performance technology Application-specific and autonomic performance tools TAU parallel performance system developments Application performance case studies New project directions Discussion LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 2 Research Motivation Tools for performance problem solving Empirical-based performance optimization process Performance technology concerns Performance Technology • Experiment management • Performance database Performance Tuning hypotheses Performance Diagnosis properties Performance Experimentation characterization Performance Observation LACSI 2004 Performance Technology • Instrumentation • Measurement • Analysis • Visualization Performance Technology for Productive, High-End Parallel Computing 3 Problem Description How does our view of this process change when we consider very large-scale parallel systems? What are the significant issues that will affect the technology used to support the process? Parallel performance observation is clearly needed In general, there is the concern for intrusion Seen as a tradeoff with performance diagnosis accuracy Scaling complicates observation and analysis Nature of application development may change Paradigm shift in performance process and technology? What will enhance productive application development? LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 4 Scaling and Performance Observation Consider “traditional” measurement methods More parallelism more performance data overall Profiling: summary statistics calculated during execution Tracing: time-stamped sequence of execution events Performance specific to each thread of execution Possible increase in number interactions between threads Harder to manage the data (memory, transfer, storage) How does per thread profile size grow? Instrumentation more difficult with greater parallelism? More parallelism / performance data harder analysis LACSI 2004 More time consuming to analyze and difficult to visualize Performance Technology for Productive, High-End Parallel Computing 5 Concern for Performance Measurement Intrusion Performance measurement can affect the execution Problems exist even with small degree of parallelism Intrusion is accepted consequence of standard practice Consider intrusion (perturbation) of trace buffer overflow Scale exacerbates the problem … or does it? Perturbation of “actual” performance behavior Minor intrusion can lead to major execution effects Traditional measurement techniques tend to be localized Suggests scale may not compound local intrusion globally Measuring parallel interactions likely will be affected Use accepted measurement techniques intelligently LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 6 Role of Intelligence and Specificity How to make the process more effective (productive)? Scale forces performance observation to be intelligent What are the important performance events and data? Standard approaches deliver a lot of data with little value Tied to application structure and computational mode Tools have poor support for application-specific aspects Process and tools can be more application-aware Will allow scalability issues to be addressed in context LACSI 2004 More control and precision of performance observation More guided performance experimentation / exploration Better integration with application development Performance Technology for Productive, High-End Parallel Computing 7 Role of Automation and Knowledge Discovery Even with intelligent and application-specific tools, the decisions of what to analyze may become intractable Scale forces the process to become more automated Performance extrapolation must be part of the process Build autonomic capabilities into the tools Support broader experimentation methods and refinement Access and correlate data from several sources Automate performance data analysis / mining / learning Include predictive features and experiment refinement Knowledge-driven adaptation and optimization guidance Address scale issues through increased expertise LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 8 TAU Parallel Performance System Goals Multi-level performance instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system Computer system architectures and operating systems Different programming languages and compilers Support for multiple parallel programming paradigms Multi-language automatic source instrumentation Multi-threading, message passing, mixed-mode, hybrid Support for performance mapping Support for object-oriented and generic programming Integration in complex software, systems, applications LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 9 TAU Performance System Architecture LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 10 TAU Performance System Architecture LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 11 TAU Instrumentation Advances Source instrumentation Program Database Toolkit (PDT) automated Fortran 90/95 support (Flint parser, very robust) statement level support in C/C++ (Fortran soon) TAU_COMPILER to automate instrumentation process Automatic proxy generation for component applications automatic CCA component instrumentation Python instrumentation and automatic instrumentation Continued integration with dynamic instrumentation Update of OpenMP instrumentation (POMP2) Selective instrumentation and overhead reduction Improvements in performance mapping instrumentation LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 12 TAU Measurement Advances Profiling Memory profiling global heap memory tracking (several options) Callpath profiling user-controllable Improved support for multiple counter profiling Online profile access and sampling Tracing calling depth Generation of VTF3 traces files (portable) Inclusion of hardware performance counts in trace files Hierarchical trace merging Online performance overhead compensation Component software proxy generation and monitoring LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 13 TAU Performance Analysis Advances Enhanced parallel profile analysis (ParaProf) Performance Data Management Framework (PerfDMF) Callpath analysis integration in ParaProf Integration with Vampir Next Generation (VNG) First release of prototype Online trace analysis Performance visualization (ParaVis) prototype Component performance modeling and QoS LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 14 Component-Based Scientific Applications How to support performance analysis and tuning process consistent with application development methodology? Common Component Architecture (CCA) applications Performance tools should integrate with software Design performance observation component Measurement port and measurement interfaces Build support for application component instrumentation Interpose a proxy component for each port Inside the proxy, track caller/callee invocations, timings Automate the process of proxy component creation using PDT for static analysis of components include support for selective instrumentation LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 15 Flame Reaction-Diffusion (Sandia, J. Ray) CCAFFEINE LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 16 Component Modeling and Optimization Given a set of components, where each component has multiple implementations, what is the optimal subset of implementations that solve a given problem? How to model a single component? How to model a composition of components? How to select optimal subset of implementations? A component only has performance meaning in context Applications are dynamically composed at runtime Application developers use components from others LACSI 2004 Instrumentation may only be at component interfaces Performance measurements need to be non-intrusive Users interested in a coarse-grained performance Performance Technology for Productive, High-End Parallel Computing 17 MasterMind Component (Trebon, IPDPS 2004) LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 18 Proxy Generator for other Applications PDT-based proxy component for: QoS tracking [Boyana, ANL] Debugging Port Monitor for CCA (tracks arguments) SCIRun2 Perfume components [Venkat, U. Utah] Exploring Babel for auto-generation of proxies: Direct SIDL to proxy code generation Generating client component interface in C++ Using LACSI 2004 PDT for generating proxies Performance Technology for Productive, High-End Parallel Computing 19 Earth Systems Modeling Framework Coupled modeling with modular software framework Instrumentation for framework and applications PDT automatic instrumentation Fortran 95 C / C++ Component instrumentation (using CCA Components) MPI wrapper library for MPI calls CCA measurement port manual instrumentation Proxy generation using PDT and runtime interposition Significant callpath profiling use by ESMF team LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 20 Using TAU Component in ESMF/CCA LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 21 TAU’s Paraprof Profile Browser (ESMF Data) Callpath profile LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 22 TAU Traces with Counters (ESMF) LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 23 Visualizing TAU Traces with Counters/Samples LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 24 Uintah Computational Framework (UCF) University of Utah, Center for Simulation of Accidental Fires and Explosions (C-SAFE), DOE ASCI Center UCF analysis Scheduling MPI library Components Performance mapping Use for online and offline visualization ParaVis tools F LACSI 2004 500 processees Performance Technology for Productive, High-End Parallel Computing 25 Scatterplot Displays Each point coordinate determined by three values: MPI_Reduce MPI_Recv MPI_Waitsome Min/Max value range Effective for cluster analysis Relation between MPI_Recv and MPI_Waitsome LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 26 Online Unitah Performance Profiling Demonstration of online profiling capability Colliding elastic disks Test material point method (MPM) code Executed on 512 processors ASCI Blue Pacific at LLNL Example LACSI 2004 Bargraph visualization MPI execution time Performance mapping Multiple time steps QuickTime™ and a GIF decompressor are needed to see this picture. Performance Technology for Productive, High-End Parallel Computing 27 Miranda Performance Analysis (Miller, LLNL) Miranda is a research hydrodynamics code Mostly synchronous Fortran 95, MPI MPI_ALLTOALL on Np x,y communicators Some MPI reductions and broadcasts for statistics Good communications scaling ACL and MCR Sibling Linux clusters ~1000 Intel P4 nodes, dual 2.4 GHz Up to 1728 CPUs Fixed workload per CPU Ported to BlueGene/L LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 28 Tau Profiling of Miranda on BG/L Miranda team is using TAU to profile code performance Routinely runs on BG/L for 1000 CPUs for 10-20 minutes Scaling studies (problem size, number of processors) 128 Nodes LACSI 2004 512 Nodes 1024 Nodes Performance Technology for Productive, High-End Parallel Computing 29 Fine Grained Profiling via Tracing Miranda uses TAU to generate traces LACSI 2004 Combines MPI calls with HW counter information Detailed code behavior to focus optimization efforts Performance Technology for Productive, High-End Parallel Computing 30 Memory Usage Analysis BG/L will have limited memory per node (512 MB) Miranda uses TAU to profile memory usage Streamlines code Squeeze larger problems on the machine Max Heap Memory (KB) used for 1283 problem on 16 processors of ASC Frost at LLNL LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 31 Kull Performance Optimization (Miller, LLNL) Kull is a Lagrange hydrodynamics code Scalar test problem analysis CCSubzonalEffects member functions Examination revealed optimization opportunities Serial execution to identify performance factors Original code profile indicated expensive functions Physics packages written in C++ and Fortran Parallel Python interpreter run-time environment! Loop merging Amortizing geometric lookup over more calculations Apply to CSSubzonalEffects member functions LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 32 Kull Optimization CSSubzonalEffects member functions total time Reduced from 5.80 seconds to 0.82 seconds Overall run time reduce from 28.1 to 22.85 seconds Original Exclusive Profile LACSI 2004 Optimized Exclusive Profile Performance Technology for Productive, High-End Parallel Computing 33 Important Questions for Application Developers How does performance vary with different compilers? Is poor performance correlated with certain OS features? Has a recent change caused unanticipated performance? How does performance vary with MPI variants? Why is one application version faster than another? What is the reason for the observed scaling behavior? Did two runs exhibit similar performance? How are performance data related to application events? Which machines will run my code the fastest and why? Which benchmarks predict my code performance best? LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 34 Multi-Level Performance Data Mining New (just forming) research project PSU Cornell UO LLNL : Karen L. Karavanic : Sally A. McKee : Allen D. Malony and Sameer Shende : John M. May and Bronis R. de Supinski Develop performance data mining technology LACSI 2004 Scientific applications, benchmarks, other measurements Systematic analysis for understanding and prediction Better foundation for evaluation of leadership-class computer systems Performance Technology for Productive, High-End Parallel Computing 35 Goals Answer questions at multiple levels of interest Data from low-level measurments and simulations use to predict application performance data mining applied to optimize data gathering process High-level performance data spanning dimensions Machine, applications, code revisions Examine broad performance trends Need technology LACSI 2004 Performance data instrumentation and measurement Performance data management Performance analysis and results presentation Automated performance exploration Performance Technology for Productive, High-End Parallel Computing 36 Specific Goals Design, develop, and populate a performance database Discover general correlations application performance and features of their external environment Develop methods to predict application performance on lower-level metrics Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a give system Performance data mining infrastructure is important for all of these goals Establish a more rational basis for evaluating the performance of leadership-class computers LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 37 PerfTrack: Performance DB and Analysis Tool PSU: Kathryn Mohror, Karen Karavanic UO: Kevin Huck, Allen D. Malony LLNL: John May, Brian Miller (CASC) LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 38 TAU Performance Data Management Framework LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 39 TAU Performance Regression (PerfRegress) LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 40 Background – Ahn & Vetter, 2002 “Scalable Analysis Techniques for Microprocessor Performance Counter Metrics,” SC2002 Applied multivariate statistical analysis techniques to large datasets of performance data (PAPI events) Cluster Analysis and F-Ratio Factor Analysis Agglomerative Hierarchical Method - dendogram identified groupings of master, slave threads in sPPM K-means clustering and F-ratio - differences between master, slave related to communication and management shows highly correlated metrics fall into peer groups Combined techniques (recursively) leads to observations of application behavior hard to identify otherwise LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 41 Thread Similarity Matrix Apply techniques from the phase analysis (Sherwood) Threads of execution can be visually compared Threads with abnormal behavior show up as less similar than other threads Each thread is represented as a vector (V) of dimension n n is the number of functions in the application V = [f1, f2, …, fn] Each value is the percentage of time spent in that function normalized (represent event mix) from 0.0 to 1.0 Distance calculated between the vectors U and V: n ManhattanDistance(U, V) = ∑ |ui - vi| i=0 LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 42 sPPM on Blue Horizon (64x4, OpenMP+MPI) • TAU profiles • 10 events • PerfDMF • threads 32-47 LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 43 sPPM on MCR (total instructions, 16x2) • TAU/PerfDMF • 120 events • master (even) • worker (odd) LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 44 sPPM on MCR (PAPI_FP_INS, 16x2) • TAU profiles • PerfDMF • master/worker • higher/lower LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 45 sPPM on Frost (PAPI_FP_INS, 256 threads) View of fewer than half of the threads of execution is possible on the screen at one time Three groups are obvious: Lower ranking threads One unique thread Higher ranking threads 3% LACSI 2004 more FP Finding subtle differences is difficult with this view Performance Technology for Productive, High-End Parallel Computing 46 sPPM on Frost (PAPI_FP_INS, 256 threads) Dendrogram shows 5 natural clusters: Unique thread High ranking master threads Low ranking master threads High ranking worker threads Low ranking worker threads • TAU profiles • PerfDMF • R access threads LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 47 sPPM on MCR (PAPI_FP_INS, 16x2 threads) masters slaves LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 48 sPPM on Frost (PAPI_FP_INS, 256 threads) After k-means clustering into 5 clusters Similar natural clusters are grouped Each groups performance characteristics analyzed 256 threads of data has been reduced to 5 clusters! SPPM 10 LACSI 2004 119 1 6 INTERF DIFUZE DINTERF Barrier [OpenMP:runhyd3.F <604,0>] 120 Performance Technology for Productive, High-End Parallel Computing 49 Extreme Performance Scalable Oss (ZeptoOS) DOE, Office of Science Investigate operating system and run-time (OS/R) functionality required for scalable components used in petascale architectures OS / RTS for Extreme Scale Scientific Computation Argonne National Lab and University of Oregon Flexible OS/R functionality Scalable OS/R system calls Performance tools, monitoring, and metrics Fault tolerance and resiliency Approach LACSI 2004 Specify OS/R requirements across scalable components Explore flexible functionality (Linux) Hierarchical designs optimized with collective OS/R interfaces Integrated (horizontal, vertical) performance measurement / analysis Fault scenarios and injection to observe behavior Performance Technology for Productive, High-End Parallel Computing 50 ZeptoOS Plans Explore Linux functionality for BG/L Explore efficiency for ultra-small kernels Construct kernel-level collective operations Scheduler, memory, IO Support for dynamic library loading, … Build Faulty Towers Linux kernel and system for replaying fault scenarios Extend TAU LACSI 2004 Profiling OS suites Benchmarking collective OS calls Observing effects of faults Performance Technology for Productive, High-End Parallel Computing 51 Discussion As high-end systems scale, it will be increasingly important that performance tools be used effectively Performance observation methods do not necessarily need to change in a fundamental sense More intelligent performance systems for productive use Just need to be controlled and used efficiently Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process Deliver to community next-generation LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 52 Support Acknowledgements Department of Energy (DOE) Office of Science contracts University of Utah ASCI Level 1 sub-contract ASC/NNSA Level 3 contract NSF High-End Computing Grant Qu i ck Ti me ™a nd a TIF F (Un co mpre ss ed )d ec omp res so r a re ne ed ed to s ee th i s pi c tu re. Research Centre Juelich John von Neumann Institute Dr. Bernd Mohr Los Alamos National Laboratory LACSI 2004 Performance Technology for Productive, High-End Parallel Computing 53