Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Programming weather, climate, and earth-system models on heterogeneous multi-core platforms Conference Sept 7 & 8 Allinea DDT 3.0 For Debugging Challenge for Weather, Climate and Earth-systems models David Maples Allinea Software Inc [email protected] www.allinea.com HPC World High Performance Computing needs ever-increasing compute power Systems in Top 500 ● 180 160 140 Performance improvements will come from: ● 120 8k - 32k cores 32k+ cores 100 Concurrency and multi-core architectures • Optimized software • Writing or migrating software for concurrency is more complex and requires different tools and skills 80 60 40 ● 20 0 2006 2007 2008 2009 2010 2006 2007 2008 2009 2010 Year (June & November Lists) www.allinea.com New Market Drivers • “Software has become the #1 roadblock … Many applications will need a major redesign” • IDC HPC Update, June 2010 – Most ISV codes do not scale – High programming costs are delaying GPU usage • Development tools are a vital part of the solution www.allinea.com Allinea Software • HPC tools company since 2001 – Allinea DDT: Scalable parallel debugger – Allinea OPT: Optimization tool for MPI and non-MPI • Large U.S. and Large European customer base – 12 of top 20 systems run Allinea DDT in EMEA – Most scalable and cost effective debugger for CUDA – Users debugging at all scales from 1 to 100,000 cores and beyond, but it's also easy to use on small clusters! – World's only Petascale debugger! www.allinea.com Clients and Partners Aviation and Defence Climate and Weather Energy Electronic Design Automation Academic Over 200 universities www.allinea.com Allinea Clients in Climate • Weather and Climate are a great fit – HLRS, our first user in Germany in 2004. – NERSC – Met Office (UK) – Proudman UK – Irish Centre for High-End Computing (ICHEC) – British Geological Survey (BGS) UK – IFREMER (France) – Meteo France – NOAA USA (Cray Linux) – Mercator Franc – US Navy – Fleet Numeric – BoM – Australia, – Royal Meteorological Institute of Belgium (IRM). www.allinea.com Collaborations Partnership to develop Petascale debugger with NVIDIA support Partnership to develop Petascale/ Exascale tools and standards Partnership on Full Scale debugging on IBM Blue Gene /P & /Q Allinea DDT is “Debugger of Choice” on NERSC 5 and NERSC 6 and first implementation on CRAY XE6 Partnership with CEA French Atomic Energy Authority on scalable programming and CUDA Partnership on Keeneland project to help solving software challenges introduced by mixed architectures www.allinea.com Allinea Software Collaborations – Technical Collaboration Results - examples • Cray • Scalability - Most Scalable Debugger for Cray • • • • Fast Track support – Rapid Debugging exclusive from Cray • UPC and CAF Support • Cray User Group • Titan Debugger Development collaboration • In house expertise on Allinea Software SGI • UPC and CAF Support for SGI Compiler • SGI Training for Allinea Users • In house Expertise on Allinea Software IBM • • Enhanced BlueGene Support for Scalable Debugging for BG/P and future Nvidia • Allinea DDT with CUDA Support • www.allinea.com Developed on Jaguar Shipped commercially since April 2010 What is the value to your work? Scalability, Ease of Use and Intuitive GUI - Allinea DDT extended capabilities - Allinea Joint Development deliverables are include in Standard Product - Allinea Collaboration with you to build new capabilities for your market - Allinea support for current and future architectures - Large group of DDT users in Weather/Climate www.allinea.com Allinea DDT - Key capabilities for WC&E www.allinea.com Use a Parallel Debugger • Many benefits to graphical parallel debuggers – Large feature sets for common bugs – Richness of user interface and real control of processes • Historically all parallel debuggers hit scale problems – Bottleneck at the front-end: Direct GUI → nodes architectures • Linear performance in number of processes – Human factors limit – mouse fatigue and brain overload • Are tools ready for the task? – Allinea DDT has changed the game www.allinea.com Achievements • Allinea DDT: First debugger with MPI and CUDA debugging – Simplifying hybrid debugging – Strong partnership with NVIDIA enables support for latest toolkit • Allinea DDT new releases support new capabilities – June 2010: Toolkit 3.0 - Nvidia DDT 2.6 – December 2010: Toolkit 3.1 and 3.2 - Nvidia DDT 2.6.1 – April 2011 Scalability and More DDT 3.0 • Allinea DDT smashes the Petascale barrier – 220,000 core debugging delivered to Oak Ridge National Laboratories – Full set of core capabilities with global ~100ms timings www.allinea.com Allinea DDT 3.0 •Petascale Architecture: Common collective process operations complete in a fraction of a second, even at over 200,000 cores! •Smart Highlighting: Automated display of the differences between processes and the changing of variable values •Visualization: New distributed multiple-dimensional array viewer with filtering •Faster C++ debugging: Automatic display of STL, Boost and Qt variables •Cross Process Comparison: Improved scalable cross process comparison •Attaching to Jobs: Improved Attach window lets you easily find and select MPI jobs and attach to subsets •HMPP Support: DDT 3.0 introduces support for CAPS HMPP •Tracepoints: Intelligent logging and merging of variable history during program execution www.allinea.com DDT in a nutshell • Scalar features – Advanced C++ and STL – Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types – Memory debugging • Multithreading & OpenMP features – Step, breakpoint etc. one or all threads • MPI features – Easy to manage groups – Control processes by groups – Compare data – Visualize message queues www.allinea.com DDT Platforms Platform Operating System MPI x86, x86_64 RHEL 4,5,6 SLES 10,11 Fedora 4 and above Ubuntu 8.04 and above All known MPIs including: GNU, Absoft, Intel, Pathscale, SGI Altix, Bproc, Bull MPI 1 PGI, Sun and 2, LAM-MPI, MPICH, Myricom MPICH-GM and MPICH-MX, Open MPI, Quadrics MPI, Platform (Scali) MPI, SCore, Scyld, Intel MPI, Slurm, MVAPICH Cell BE Fedora Core 7 Yellow Dog As above Cell BE SDK 3.0 IBM Power AIX 5.3 and above IBM PE, MPICH Native, GNU Sun Sparc Solaris 9 and above Sun Clustertools 5 and above Native - Studio 11 Sun Solaris Opteron Solaris 10 and above Sun Clustertools 6, 7 and MPICH Cray XT/XE SLES 10,11 (frontend) Cray MPT (aprun) and Open Cray, PGI, Pathscale, Intel, MPI GNU Blue Gene/P SLES 10 (frontend) Native Native, GNU NEC SX 9 SUPER-UX 15.1 (backend only) Native Native www.allinea.com Compilers Native - Studio 11/12, GNU CAPS HMPP Support Automatic detection of HMPP code fragments and set breakpoints before/after kernel ● ● Step-over a kernel –Ignore HMPP wrapper layers ● Suppress stack of HMPP internals to report only user code and high-level name of HMPP fragment ● Obtain error codes (if possible) from HMPP kernels www.allinea.com Handling Regular Bugs • Immediate stop on crash – Segmentation fault, or other memory problems – Abort, exit, error handlers – CUDA errors • Scalable handling of error messages • Leaps to the problem – Source code highlighted – Affected processes shown – Process stacks displayed clearly in parallel www.allinea.com Finding the cause • Full class/structure browsing – Locals and Current line(s) • Show variables relevant to current position • Drag in the source code to see more – C, C++, F90: object members, static members and derived types • Automatic comparison and change detection – Scalable and fast www.allinea.com Smart Highlighting • Compare variables across processes and instantly detect changes: Fast and scalable! −Blue: Value change −Green: Different value on other process(es) • Full class/structure browsing − Local variables and current line(s) • Show variables relevant to current position • Drag in the source code to see more − C, C++, F90: object members, static members, derived types www.allinea.com Finding rogue processes • Easy to find where differences are: – Cross process comparison of data •Fetches values from every process, compares and then groups by value •Summary of NaN, Inf and statistics – Easy to spot rogues • Use to group processes –Define process group and control enmasse www.allinea.com Cross Process Comparison • Analyse expressions calculated on each process in the current process group • Cross process comparison of data • Fetches values from every process, compares and then groups by value • Summary of NaNs, Infs and statistics • Easy to spot rogue processes! • Use to group processes –Define a process group www.allinea.com Visualization 3-D Visualization of distributed data using the Multi-Dimensional Array viewer • Large Array Support • Browse arrays – 1, 2, 3… dimensions – Table view • Filtering – Look for an outlying value • Export – Save to a spreadsheet • View arrays from multiple processes – Search through terabytes for rogue data in parallel www.allinea.com Tracepoints • Intelligent logging and merging of variable history during execution • “Scalable printf”: – No need to recompile your program – Merging helps prevent information overload: Network traffic and user interface – Add conditions to filter output • Allows you to view both the data and the lines of code your program is executing without stopping – View program flow and state quickly over multiple iterations •Save output for offline analysis – Free up system resources www.allinea.com Improved C++ debugging • Faster startup when debugging C++ codes – Much improved performance for heavily templated code • Edit Type Feature – Helps viewing polymorphic types • Automatic display of STL, Boost and Qt containers Easily de-reference pointers – Easily view the contents of the data structure Before www.allinea.com After Attaching to Jobs • Improved Attach window allows you to easily find and select MPI jobs and attach to running processes • Clicking the Attach to a Running Program button on the Welcome Screen will show DDT's Attach Window: – List of automatically detected MPI jobs: No need to select individual processes – Or you can manually select from a list of processes if required www.allinea.com Memory Debugging Find memory leaks Or stop on read/write beyond end of array: www.allinea.com Debugging at Scale www.allinea.com Problems at Scale • Increasing job sizes leads to unanticipated errors – Regular bugs • Data issues from larger data sets – eg. garbage in..., overflow • Logic issues and control flow – Increasing probability of independent random error • Memory errors/exhaustion – “random” bugs! • System problems – MPI and operating system – Pushing coded boundaries • Algorithmic (performance) • Hard-wired limits (“magic numbers”) – Unknown unknowns • .... www.allinea.com Strategies for bug fixing I • Improved coding standards – unit tests, assertions – Good practice – but coverage is rarely perfect • Random/system issues – often missed – Combines well with debuggers • Find why a failure occurs not just a pass/fail • Logging – printf and write – If you have good intuition into the problem • Edit code, insert print, recompile and re-run • Slow and iterative – Post-mortem analysis only • Hard establish real order of output of multiple processes • Rapid growth in log output size • Unscalable www.allinea.com Strategies for bug fixing II • Reproduce at a smaller scale – Attempt to make problem happen on fewer nodes • Often requires reduced data set – the large one may not fit – Smaller data set may not trigger the problem • Does the bug even exist on smaller problems? – Didn't you already try the code at small scale? • Is it a system issue – eg. an MPI problem? – Is probability stacking up against you? • Unlikely to spot on smaller runs – without many many runs • But near guaranteed to see it on a many-thousand core run – What can a parallel debugger do to help? • Debug at the scale of the problem - Now. www.allinea.com Scalable Process Control •Parallel Stack View • Finds rogue processes quickly • Identify classes of process behaviour • Rapid grouping of processes •Control Processes by Groups • Set breakpoints, step, play, stop for groups • Scalable groups view: compact group display www.allinea.com Petascale Architecture DDT 3.0 Performance Figures • Logarithmic performance due to new tree architecture • Many operations are now faster at 220,000 than previously at 1000 cores • ~1/10th of a second to step and gather all stacks at 220,000 cores 0.12 0.1 0.08 All Step 0.06 All Breakpoint 0.04 Time (Seconds) • Developed due to collaborations with ORNL on Jaguar Cray XT, ANL and CEA 0.02 • A massive performance revolution for every user’s benefit! www.allinea.com 0 0 50,000 100,000 150,000 MPI Processes 200,000 Debugging GPU Applications www.allinea.com CUDA Debugging Options • Old world “printf” • NVIDIA SDK 3.0 allows this but with limitations • Fake it – Run the kernel on the host x86_64 processor • • • • Languages often support targeting host CPU instead of GPU Different numeric precision – different answer? Different scheduling – different answer? A reasonable option for some bugs • Or run on the GPU with Allinea DDT... www.allinea.com GPUs Made Easy • View all threads in parallel stack view • At one glance, see all GPU and CPU threads together • Links with thread selection • Pick a tree node to select one of the CUDA threads at that location • Full MPI support • See GPU and CPU threads from multiple nodes www.allinea.com Debugging Kernels • Debugging CPU and GPU concurrently – Browse source, examine variables, control processes and threads • Set breakpoints – Automatically stop on kernel launch – Stop at a line of CUDA code – Kernels stop when breakpoint reached – Hover the mouse for more information • Step a warp - 32 CUDA threads www.allinea.com Examine Thread Data • At a glance display of variables – Expressions, local variables, and current line – Also possible to edit values • Displays the memory types – shared, parameter, constant, register, … www.allinea.com DDT CUDA Status • NVIDIA SDK 3.1, SDK 3.2 • Allinea DDT 3.0 – Multi-device support – Fermi and Tesla support – CUDA Memcheck support for memory errors – MPI and CUDA support for GPU clusters – Breakpoints, thread control, and data evaluation – Stop on kernel launch www.allinea.com Summary • Debuggers are the right tools to fix bugs quickly – Other methods have limited success and issues at scale • Allinea DDT scales in both performance and interface – Breaking all records and making problems manageable – Be sure to get DDT release 3.0! • Allinea DDT supports NVIDIA CUDA with the ability to debug code running on both the CPU and GPU – NVIDIA SDK 3.1, SDK 3.2 and available for SDK 4.0 • Contact [email protected] or [email protected] www.allinea.com Thank You David Maples Allinea Software Inc. 2033 Gateway Pl. San Jose, Ca. 408 884 0282 [email protected] www.allinea.com