Download A leap forward with UTK's Cray XC30

A leap forward with UTK’s Cray XC30 Mark R. Fahey NICS Deputy Director Industrial and Systems Engineering University of Tennessee Knoxville Outline  Overview  Architecture  Deployment  Results – MPI Benchmarks – Rank placement – Applications  Conclusions 2 Deploying Darter - ISC14, June 24, 2014 Overview  To meet the ever increasing computational science research needs at the University of Tennessee, Knoxville (UTK) and its collaborating institutions, the Joint Institute for Computational Sciences (JICS) acquired a new Cray XC30 supercomputer, called Darter  Peak performance of 248.9 Teraflops  The Cray XC30 combines Intel Xeon processors, the Cray Aries interconnect in a Dragonfly network topology, and enhancements to the Cray operating system and programming environments  Also configured with an integrated storage solution (Sonexion) 3 Deploying Darter - ISC14, June 24, 2014 Darter architecture 248.9 TF/s peak Component 4 4 cabinets processor Intel Xeon E5 Cores/processor 8 Hyperthreads/core 2 Core frequency 2.6 GHz Sockets/node 2 Memory/node 32 GB Mem bandwidth/node 25.6 GB/s Total number of nodes 748 * Total number of processors 1,496 * Total number of cores 11,968 * NIC Aries Topology Dragonfly Deploying Darter - ISC14, June 24, 2014 2 external login nodes 5 services blades * Recent IO upgrades have reduced compute Schedule • • • • • System was deployed, accepted, and placed into production in only 2 weeks. Spring Experiment for the Center for Analysis and Prediction of Storms largely drove the accelerated timeline, as the experiment was scheduled to start in mid-April HPC System Administrators and Computational Scientists worked around the clock to meet the April 2nd deliverable Scientists from CAPS immediately began functionality testing as they worked to port their scientific models to this new computing architecture Cray provided remote access to the system at their Chippewa Falls location, so that much of the preliminary work could be accomplished prior to delivery 5 Stage Date Hardware arrived March 19, 2013 System booted March 20, 2013 Hardware diagnostics completed March 21, 2013 Acceptance testing began March 22, 2013 Acceptance testing completed March 28, 2013 Early user access April 2, 2013 Deploying Darter - ISC14, June 24, 2014 Acceptance Testing - Darter  Functionality and performance phases were combined into a single set of tests that were run continuously for 12 hours. The stability phase was run for a total of 72 hours, which included 12 hours of combined functionality-performance.  A series of benchmarks including HPL, MPI-Stream, G-FFTE, Intel MPI Benchmark (IMB), and IOR were utilized to test aspects of the system’s processors, memory, interconnect, and filesystem for proper performance.  Scientific applications (PARATEC and WRF) and the HPCC benchmarks were run on the system to test for proper system operation. These applications and benchmarks along with IMB were run continuously during the stability test phase of the application acceptance test. – Over 700 tests were executed obtaining a pass percentage of over 99.8%. – A single job failure was observed during stability testing, which was due to a node failure. – Other job failures not represented in the previous statistics (ten total) were due to issues not related to system or application operation. 6 Deploying Darter - ISC14, June 24, 2014 Acceptance Testing: Performance Application Metric Observed HPL Using at least 90% of the compute cores achieve >= 65% of the theoretical peak of the processors used 200.057 TF on 744 nodes, ~ 80.8% of peak MPI-Stream G-FFTE Test on at least 90% of the nodes achieving 50% of peak. Using 8192 cores should achieve at least 1.5 TF performance. Ran on 675 nodes (over 90% of all computes) and measured an aggregate of 52308.3 GB/s, or 77.49 GB/s per node, which is 75.7% of peak. 1.99 TF on 8192 cores File per task 7 GB/s write rate: 10.32GB/s. read rate: 4.89GB/s. average of the write and read rates 7.61GB/s. No more than 5 shared files 4.5 GB/s write rate: 6.85GB/s. read rate: 6.10GB/s IOR 7 Deploying Darter - ISC14, June 24, 2014 Acceptance Testing – Results  Significant improvements in application performance due to the improvements in processor, memory, and interconnect performance compared to previous Cray generations. – The Intel Xeon processors provide a factor of two improvement in both peak performance and the Highly Parallel Linpack (HPL) benchmark on a per-core basis over the AMD Operton processors utilized in Kraken (Cray XT5) – Improvements in memory bandwidth provide another three-fold increase in performance over the Cray XT5. A set of six applications including WRF were run between these systems showing average per core performance improvements by factors between 1 and 4. – WRF showed the most improvement (nearly 4 times) while all but one additional applications demonstrated per core performance improvements nearly equal to or greater than 2.  Considering only acquisition and operating costs, the total cost of ownership (TCO) for the Cray XC30 is roughly 2.5 times that of the Cray XT5 on a per core basis. The performance improvements are typically greater than this factor. Of course, this reported TCO for the XC30 will decrease with time as acquisition costs are amortized and therefore the performance improvement gains over the XT5 will increase relative to the TCO. 8 Deploying Darter - ISC14, June 24, 2014 MPI Benchmarks  The XC30 is the first Cray line that utilizes the Dragonfly network topology. – This hierarchical network topology has obvious advantages with respect to local communication – However, the trade-offs between bandwidth and latency are less apparent for different communication patterns as they extend further down the hierarchy – To gain some insight XC30 network performance, JICS staff ran several experiments using Intel MPI Benchmarks (IMB)  To provide the flexibility in running different MPI benchmarks and experiments, a custom code was used that creates MPI communicators – This code wraps IMB, and passes the created MPI communicators to IMB for it to run the benchmarks  Next slide just shows point-to-point benchmarks. More information in the paper 9 Deploying Darter - ISC14, June 24, 2014 Bandwidth and latency of two typical point-to-point MPI communication patterns, Sendrecv and Exchange For both Sendrecv and Exchange, communication latency remains more or less constant for message sizes up to 1 Kilobytes. A possible explanation for this is that 1024 bytes is the default cut-off point for the short message algorithm in Cray’s communication libraries. For both of these communication patterns, bandwidth continues to increase with the size of the messages. Even at the largest message sizes utilized in these benchmarks, the available bandwidth of the network links in the system has not been saturated. 10 Deploying Darter - ISC14, June 24, 2014 Rank placement  One premise of the XC30 system is that parallel job efficiency is independent of its placement on the system – In other words, applications “can be placed anywhere on the system without having significant performance degradation” – This premise is supported by at least two features of the XC30 interconnects: the high global bandwidth and extensive use of adaptive routing to minimize the number of connections and avoid network congestion  JICS tested this premise using two applications on Darter. – One application is a domain-decomposed, hydrodynamics solver in which every time step requires several ghost exchanges and global time-step reductions via MPI AllReduce to follow an evolution of a Sedov-Taylor Blast Wave test problem – The second application is an FFT-based Poisson’s equation solver called PSPFFT in which every step requires three- dimensional transpose via MPI Alltoall on multiple MPI sub-communicators 11 Deploying Darter - ISC14, June 24, 2014 Rank placement (2)  The relative time difference as a function of “average node distance” is shown. – Avg node distance increases as nodes are further apart in physical location and varies linearly with node ID, although independent of network distance since topological information is ignored.  Figures show that for small job sizes, job placement hardly affected job run-time. For larger job sizes, the data suggests that there are benefits when a job occupies nodes that are on average further apart in the network – This result is somewhat counter-intuitive, and speculate that by using nodes that are less packed together, network traffic has the possibility to take different routes on this topology and due to the dynamic routing algorithm of the interconnect better avoid congestion – However, further work is warranted to definitively explain this behavior. A more rigorous and controlled study of the dynamic routing behavior is warranted when a mechanism to access this algorithm is available, but no mechanism is available from Cray yet to do this 12 Deploying Darter - ISC14, June 24, 2014 Extreme Climate Events 2-3x runtime speedup  effects on sensitive ecosystems will be more pronounced and observable as greenhouse gas levels increase – Understanding the linkages among climate change on the global scale, shifting weather patterns and air quality on the regional scale and extreme events on the basin scale will be key to the characterization of ecological impacts of climate change – In particular, most of countries in Asia (e.g. China and South Asia and Southeast Asia) are encountering more serious green-house and other pollutants (SO2, NOx and PM2.5) emission problems due to its rapid urbanization process and economic blooming  global climate model Community Earth System Model (CESM), regional climate model Weather Research and Forecasting (WRF) and regional chemistry model Community Multiscale Air Quality Modeling System (CMAQ) have been adopted to investigate the dynamical downscaling technique on fine horizontal resolution – With the high resolution, it is easier and more reasonable to examine the frequency and intensity of extreme events and air quality in different regions and cities 13 Figure shows the comparison of average surface temperature between the present and future cases. The average temperature for most of Eastern China has increased by the end of 2050s, especially the south and east parts. Further analysis about the intensity and duration of extreme weather events like heat wave and extreme precipitation will be performed and evaluated statistically for the interested areas. Deploying Darter - ISC14, June 24, 2014 Fusion Microturbulence - GYRO GYRO nl02a Benchmark - MPI only 250  Simulation of fusion microturbulence  Solves 5-D coupled time-dependent nonlinear gyrokinetic-Maxwell equations with gyrokinetic ions and electrons 200 Timesteps per Second  Computes the turbulent radial transport of particles and energy in tokamak plasmas 150 100 50 0 50 100 150 200 250 300 350 400 450 500 550 Processors GYRO nl02a Benchmark - MPI/OpenMP - 128 MPI processes only  Can operate as a flux-tube (local) code, or as a global code, with electrostatic or electromagnetic fluctuations 450 Darter Beacon-Xeon Ares Kraken 400 Timesteps per Second  Propagates system forward using either 4th-order explicit RK integrator or a 2ndorder, implicit-explict RK integrator with fourth-order, explicit Eulerian algorithm Darter (Intel E5-2670) Darter (Intel E5-2670) Beacon (Xeon E5-2670) Ares (AMD Interlagos) Kraken (AMD Istanbul) 350 300 250 200 150 100 50 0 1 2 3 4 Threads 14 Deploying Darter - ISC14, June 24, 2014 5 6 7 8 Applications – storm prediction  The Center for Analysis and Prediction of Storms (CAPS) at the University of Oklahoma ran Storm-Scale Ensemble Forecasts (SSEF) – supporting the National Oceanic and Atmospheric Administration (NOAA) Hazardous Weather Testbed (HWT) 2013 Spring Experiment – a yearly high profile experiment that investigates the use of convectionallowing model forecasts as guidance for the prediction of hazardous convective weather  The 2013 CAPS SSEF runs lasted seven weeks between April 22 and June 7, 2013, overlapping with the HWT 2013 Spring Experiment from May 6 to June 7, 2013 – a thirty-member storm-scale ensemble forecast of 48 hours, at 4-km horizontal grid spacing over the entire continental United States, was produced daily – Three state-of-the-science numerical weather prediction (NWP) models were used.  Advanced Research version of the Weather Research and Forecast model (WRF-ARW)  Advanced Regional Prediction System (ARPS)  Navy COAMPS model system. 15 – Each ensemble member forecast has unique initial condition and lateral condition perturbations and model physics options – Over 140 WSR-88D Doppler weather radar and conventional observations over the CONUS were assimilated in real-time to each member using the ARPS 3DVAR and Complex Cloud Analysis package – Ensemble forecast products, including probabilistic severe weather guidance, tornadic weather potential and intensity, flash flood and damaging wind guidance, were made available to HWT participants 12-36 hours in advance Deploying Darter - ISC14, June 24, 2014 20 hour forecast initiated from 00 UTC May 20, 2013, valid at 20 UTC May 20 at the time of the Moore tornado touch down 3x runtime speedup Applications – light water reactor modeling  The Consortium for Advanced Simulation of Light Water Reactors (CASL) was established as the first U.S. Department of Energy (DOE) Innovation Hub, and was created to accelerate the application of advanced modeling and simulation (M&S) to the analysis of nuclear reactors.  CASL applies existing M&S capabilities and develops advanced capabilities to create a usable environment for predictive simulation of light water reactors (LWRs).  The CASL vision is to predict the performance of nuclear reactors through comprehensive, science-based modeling and simulation technology that is deployed and applied broadly throughout the nuclear energy industry to enhance safety, reliability, and economics.  Usage of Darter for CASL was as a development platform for a physics component known as MPACT, being developed primarily by staff and students at the University of Michigan. – – – MPACT is one of two components being developed to simulate the behavior of neutrons within reactor cores MPACT performed multiple simulations that used over 3,000 cores for roughly 3.5 hours. A result is depicted at right  One of the most significant milestones CASL has delivered, simulation of the Watts Bar Unit 1 zero power physics test, was completed successfully in early June of 2013. 16 Deploying Darter - ISC14, June 24, 2014 Thermal Neutron Flux Distribution Generated by MPACT on Darter 2x runtime speedup compared to XE6 Applications – astrophysics  UTK-ORNL Astrophysics Theory Group uses their multidimensional radiation hydrodynamics code, CHIMERA, to make substantial strides in furthering our understanding of the explosion mechanism in core-collapse supernovae. – CHIMERA has been used to do supernovae modeling on previous generations of Cray machines and has easily been ported to run on Darter.  Core-collapse supernovae are the violent death of massive stars and are among the most powerful explosions in the universe – core-collapse supernovae are an important link in our chain of origins from the Big Bang to the present. – During their lifetime, massive stars are the dominant sites where elements in the periodic table between oxygen and iron are synthesized in the process known as stellar nucleosynthesis. – Explosion events then disperse these elements to the rest of the universe, making life as we know it possible.  Nucleosynthesis does not cease during the explosion event of core-collapse supernovae. – Growing evidence that core-collapse supernova are responsible for producing half the elements heavier than iron in a process known as r-process nucleosynthesis. 17 Deploying Darter - ISC14, June 24, 2014 Applications – astrophysics 3x runtime speedup compared to XE6  UTK-ORNL Astro Group is evolving a two-dimensional core-collapse supernova model from the inner core of a 15 solar mass, non-rotating progenitor on a polar axisymmetric grid with 720 radial zones and 240 angular zones and a 150-species nuclear reaction network. – Early results are promising, and show quantitative differences with simulations from the same model with an α-network. – Qualitative comparisons, however, must be postponed until the explosion energy asymptotes and tracer particle fates are determined, marking the “completion” of the model. – Prior experience has shown that this is typically achieved after roughly 1.5 seconds of evolution. After 10 days of computation on 1,920 cores the model has evolved 650 ms in total, and the average shock radius has reached approximately 880 km Entropy plot of the model with 150-species nuclear network at 300 ms after core bounce, where some of the characteristic behavior of the explosion, such as SASI [3] is apparent. 18 Deploying Darter - ISC14, June 24, 2014 Conclusions  UTK acquired a Cray XC30 supercomputer, called Darter, with a peak performance of 248.9 Teraflops  Deployed in late March of 2013 with a very aggressive production timeline  Science productivity immediately – Even supported a nightly operational projects (storm prediction) – Most applications have seen 2-3x speedup with just a simple recompile (no optimization)  Application teams have benefited highly – Consistency of the Cray multigenerational software stack – Quick ports and immediate speedups due to processor, interconnect, and compiler 19 Deploying Darter - ISC14, June 24, 2014 Acknowledgments  We would like to thank – – – – Joshua Fu for the extreme climate events highlight Ming Xue for contribution of the CAPS highlight John Turner for the contribution of the CASL highlight James Austin for the contribution of the Astrophysics highlight  This material is based upon work performed using computational resources supported by the University of Tennessee and Oak Ridge National Laboratory Joint Institute for Computational Sciences 20 Deploying Darter - ISC14, June 24, 2014 Thank you  Questions? – [email protected] 21 Deploying Darter - ISC14, June 24, 2014 Extra 22 Deploying Darter - ISC14, June 24, 2014 Acceptance Testing  All production JICS resources are put through a defined set of acceptance tests to determine that their functionality and performance baselines meet expected parameters.  Based on the size and expected use of the system, varying criteria are defined as components of the acceptance test.  In general, the acceptance tests are divided into hardware and application tests. The hardware tests include initial diagnostics and testing that confirm the delivered hardware and firmware are correctly configured and functioning.  Additionally, the system is configured to utilize the center’s infrastructure including necessary file systems, user authentication, and network.  The application tests are further subdivided into functionality, performance, and stability phases.  At the end of each phase the application acceptance tests are continuously run on the system for a specified amount of time.  This “continuous system integration” step ensures that all parts of the system are tested under conditions similar to production utilization. 23 Deploying Darter - ISC14, June 24, 2014

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A leap forward with UTK's Cray XC30