Download A leap forward with UTK's Cray XC30

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
A leap forward with
UTK’s Cray XC30
Mark R. Fahey
NICS Deputy Director
Industrial and Systems Engineering
University of Tennessee Knoxville
Outline
 Overview
 Architecture
 Deployment
 Results
– MPI Benchmarks
– Rank placement
– Applications
 Conclusions
2
Deploying Darter - ISC14, June 24, 2014
Overview
 To meet the ever increasing computational science
research needs at the University of Tennessee, Knoxville
(UTK) and its collaborating institutions, the Joint Institute
for Computational Sciences (JICS) acquired a new Cray
XC30 supercomputer, called Darter
 Peak performance of 248.9 Teraflops
 The Cray XC30 combines Intel Xeon processors, the Cray
Aries interconnect in a Dragonfly network topology, and
enhancements to the Cray operating system and
programming environments
 Also configured with an integrated storage solution
(Sonexion)
3
Deploying Darter - ISC14, June 24, 2014
Darter architecture
248.9 TF/s peak
Component
4
4 cabinets
processor
Intel Xeon E5
Cores/processor
8
Hyperthreads/core
2
Core frequency
2.6 GHz
Sockets/node
2
Memory/node
32 GB
Mem bandwidth/node
25.6 GB/s
Total number of nodes
748 *
Total number of processors
1,496 *
Total number of cores
11,968 *
NIC
Aries
Topology
Dragonfly
Deploying Darter - ISC14, June 24, 2014
2 external login nodes
5 services blades
* Recent IO upgrades
have reduced compute
Schedule
•
•
•
•
•
System was deployed, accepted, and placed into production in only 2 weeks.
Spring Experiment for the Center for Analysis and Prediction of Storms largely drove the
accelerated timeline, as the experiment was scheduled to start in mid-April
HPC System Administrators and Computational Scientists worked around the clock to
meet the April 2nd deliverable
Scientists from CAPS immediately began functionality testing as they worked to port
their scientific models to this new computing architecture
Cray provided remote access to the system at their Chippewa Falls location, so that
much of the preliminary work could be accomplished prior to delivery
5
Stage
Date
Hardware arrived
March 19, 2013
System booted
March 20, 2013
Hardware diagnostics completed
March 21, 2013
Acceptance testing began
March 22, 2013
Acceptance testing completed
March 28, 2013
Early user access
April 2, 2013
Deploying Darter - ISC14, June 24, 2014
Acceptance Testing - Darter
 Functionality and performance phases were combined into a single set of tests that
were run continuously for 12 hours. The stability phase was run for a total of 72
hours, which included 12 hours of combined functionality-performance.
 A series of benchmarks including HPL, MPI-Stream, G-FFTE, Intel MPI Benchmark
(IMB), and IOR were utilized to test aspects of the system’s processors, memory,
interconnect, and filesystem for proper performance.
 Scientific applications (PARATEC and WRF) and the HPCC benchmarks were run
on the system to test for proper system operation. These applications and
benchmarks along with IMB were run continuously during the stability test phase
of the application acceptance test.
– Over 700 tests were executed obtaining a pass percentage of over 99.8%.
– A single job failure was observed during stability testing, which was due to a node failure.
– Other job failures not represented in the previous statistics (ten total) were due to issues not related
to system or application operation.
6
Deploying Darter - ISC14, June 24, 2014
Acceptance Testing: Performance
Application
Metric
Observed
HPL
Using at least 90% of the compute cores
achieve >= 65% of the theoretical peak of
the processors used
200.057 TF on 744 nodes,
~ 80.8% of peak
MPI-Stream
G-FFTE
Test on at least 90% of the nodes
achieving 50% of peak.
Using 8192 cores should achieve at least
1.5 TF performance.
Ran on 675 nodes (over 90% of all computes) and
measured an aggregate of 52308.3 GB/s, or 77.49
GB/s per node, which is 75.7% of peak.
1.99 TF on 8192 cores
File per task
7 GB/s
write rate: 10.32GB/s.
read rate: 4.89GB/s.
average of the write and read rates 7.61GB/s.
No more than 5 shared files
4.5 GB/s
write rate: 6.85GB/s.
read rate: 6.10GB/s
IOR
7
Deploying Darter - ISC14, June 24, 2014
Acceptance Testing – Results
 Significant improvements in application performance due to the improvements in
processor, memory, and interconnect performance compared to previous Cray
generations.
– The Intel Xeon processors provide a factor of two improvement in both peak performance and the
Highly Parallel Linpack (HPL) benchmark on a per-core basis over the AMD Operton processors
utilized in Kraken (Cray XT5)
– Improvements in memory bandwidth provide another three-fold increase in performance over the
Cray XT5. A set of six applications including WRF were run between these systems showing
average per core performance improvements by factors between 1 and 4.
– WRF showed the most improvement (nearly 4 times) while all but one additional applications
demonstrated per core performance improvements nearly equal to or greater than 2.
 Considering only acquisition and operating costs, the total cost of ownership
(TCO) for the Cray XC30 is roughly 2.5 times that of the Cray XT5 on a per core
basis. The performance improvements are typically greater than this factor. Of
course, this reported TCO for the XC30 will decrease with time as acquisition costs
are amortized and therefore the performance improvement gains over the XT5 will
increase relative to the TCO.
8
Deploying Darter - ISC14, June 24, 2014
MPI Benchmarks
 The XC30 is the first Cray line that utilizes the Dragonfly network
topology.
– This hierarchical network topology has obvious advantages with respect to local
communication
– However, the trade-offs between bandwidth and latency are less apparent for different
communication patterns as they extend further down the hierarchy
– To gain some insight XC30 network performance, JICS staff ran several experiments
using Intel MPI Benchmarks (IMB)
 To provide the flexibility in running different MPI benchmarks and
experiments, a custom code was used that creates MPI
communicators
– This code wraps IMB, and passes the created MPI communicators to IMB for it to run
the benchmarks
 Next slide just shows point-to-point benchmarks. More information
in the paper
9
Deploying Darter - ISC14, June 24, 2014
Bandwidth and latency of two typical point-to-point MPI
communication patterns, Sendrecv and Exchange
For both Sendrecv and Exchange, communication latency remains more or less constant for message
sizes up to 1 Kilobytes. A possible explanation for this is that 1024 bytes is the default cut-off point for
the short message algorithm in Cray’s communication libraries. For both of these communication
patterns, bandwidth continues to increase with the size of the messages. Even at the largest message
sizes utilized in these benchmarks, the available bandwidth of the network links in the system has not
been saturated.
10
Deploying Darter - ISC14, June 24, 2014
Rank placement
 One premise of the XC30 system is that parallel job efficiency is
independent of its placement on the system
– In other words, applications “can be placed anywhere on the system without having
significant performance degradation”
– This premise is supported by at least two features of the XC30 interconnects: the high
global bandwidth and extensive use of adaptive routing to minimize the number of
connections and avoid network congestion
 JICS tested this premise using two applications on Darter.
– One application is a domain-decomposed, hydrodynamics solver in which every time
step requires several ghost exchanges and global time-step reductions via MPI
AllReduce to follow an evolution of a Sedov-Taylor Blast Wave test problem
– The second application is an FFT-based Poisson’s equation solver called PSPFFT in
which every step requires three- dimensional transpose via MPI Alltoall on multiple
MPI sub-communicators
11
Deploying Darter - ISC14, June 24, 2014
Rank placement (2)
 The relative time difference as a function of
“average node distance” is shown.
– Avg node distance increases as nodes are further apart in
physical location and varies linearly with node ID,
although independent of network distance since
topological information is ignored.
 Figures show that for small job sizes, job
placement hardly affected job run-time. For larger
job sizes, the data suggests that there are
benefits when a job occupies nodes that are on
average further apart in the network
– This result is somewhat counter-intuitive, and speculate
that by using nodes that are less packed together,
network traffic has the possibility to take different routes
on this topology and due to the dynamic routing
algorithm of the interconnect better avoid congestion
– However, further work is warranted to definitively explain
this behavior. A more rigorous and controlled study of
the dynamic routing behavior is warranted when a
mechanism to access this algorithm is available, but no
mechanism is available from Cray yet to do this
12
Deploying Darter - ISC14, June 24, 2014
Extreme Climate Events
2-3x runtime speedup
 effects on sensitive ecosystems will be more
pronounced and observable as greenhouse gas
levels increase
– Understanding the linkages among climate change on the
global scale, shifting weather patterns and air quality on the
regional scale and extreme events on the basin scale will be
key to the characterization of ecological impacts of climate
change
– In particular, most of countries in Asia (e.g. China and
South Asia and Southeast Asia) are encountering more
serious green-house and other pollutants (SO2, NOx and
PM2.5) emission problems due to its rapid urbanization
process and economic blooming
 global climate model Community Earth System
Model (CESM), regional climate model Weather
Research and Forecasting (WRF) and regional
chemistry model Community Multiscale Air Quality
Modeling System (CMAQ) have been adopted to
investigate the dynamical downscaling technique
on fine horizontal resolution
– With the high resolution, it is easier and more reasonable to
examine the frequency and intensity of extreme events and
air quality in different regions and cities
13
Figure shows the comparison of average
surface temperature between the present and
future cases. The average temperature for
most of Eastern China has increased by the
end of 2050s, especially the south and east
parts. Further analysis about the intensity and
duration of extreme weather events like heat
wave and extreme precipitation will be
performed and evaluated statistically for the
interested areas.
Deploying Darter - ISC14, June 24, 2014
Fusion Microturbulence - GYRO
GYRO nl02a Benchmark - MPI only
250
 Simulation of fusion microturbulence
 Solves 5-D coupled time-dependent
nonlinear gyrokinetic-Maxwell equations
with gyrokinetic ions and electrons
200
Timesteps per Second
 Computes the turbulent radial transport
of particles and energy in tokamak
plasmas
150
100
50
0
50
100
150
200
250
300
350
400
450
500
550
Processors
GYRO nl02a Benchmark - MPI/OpenMP - 128 MPI processes only
 Can operate as a flux-tube (local) code, or
as a global code, with electrostatic or
electromagnetic fluctuations
450
Darter
Beacon-Xeon
Ares
Kraken
400
Timesteps per Second
 Propagates system forward using either
4th-order explicit RK integrator or a 2ndorder, implicit-explict RK integrator with
fourth-order, explicit Eulerian algorithm
Darter (Intel E5-2670)
Darter (Intel E5-2670)
Beacon (Xeon E5-2670)
Ares (AMD Interlagos)
Kraken (AMD Istanbul)
350
300
250
200
150
100
50
0
1
2
3
4
Threads
14
Deploying Darter - ISC14, June 24, 2014
5
6
7
8
Applications – storm prediction
 The Center for Analysis and Prediction of Storms (CAPS) at the
University of Oklahoma ran Storm-Scale Ensemble Forecasts
(SSEF)
– supporting the National Oceanic and Atmospheric Administration (NOAA)
Hazardous Weather Testbed (HWT) 2013 Spring Experiment
– a yearly high profile experiment that investigates the use of convectionallowing model forecasts as guidance for the prediction of hazardous
convective weather
 The 2013 CAPS SSEF runs lasted seven weeks between April 22
and June 7, 2013, overlapping with the HWT 2013 Spring
Experiment from May 6 to June 7, 2013
– a thirty-member storm-scale ensemble forecast of 48 hours, at 4-km horizontal
grid spacing over the entire continental United States, was produced daily
– Three state-of-the-science numerical weather prediction (NWP) models were
used.
 Advanced Research version of the Weather Research and Forecast model (WRF-ARW)
 Advanced Regional Prediction System (ARPS)
 Navy COAMPS model system.
15
– Each ensemble member forecast has unique initial condition and lateral
condition perturbations and model physics options
– Over 140 WSR-88D Doppler weather radar and conventional observations over
the CONUS were assimilated in real-time to each member using the ARPS
3DVAR and Complex Cloud Analysis package
– Ensemble forecast products, including probabilistic severe weather guidance,
tornadic weather potential and intensity, flash flood and damaging wind
guidance, were made available to HWT participants 12-36 hours in advance
Deploying Darter - ISC14, June 24, 2014
20 hour forecast initiated from 00
UTC May 20, 2013, valid at 20
UTC May 20 at the time of the
Moore tornado touch down
3x runtime speedup
Applications – light water reactor modeling
 The Consortium for Advanced Simulation of Light Water Reactors (CASL)
was established as the first U.S. Department of Energy (DOE) Innovation
Hub, and was created to accelerate the application of advanced modeling
and simulation (M&S) to the analysis of nuclear reactors.
 CASL applies existing M&S capabilities and develops advanced capabilities
to create a usable environment for predictive simulation of light water
reactors (LWRs).
 The CASL vision is to predict the performance of nuclear reactors through
comprehensive, science-based modeling and simulation technology that is
deployed and applied broadly throughout the nuclear energy industry to
enhance safety, reliability, and economics.
 Usage of Darter for CASL was as a development platform for a physics
component known as MPACT, being developed primarily by staff and
students at the University of Michigan.
–
–
–
MPACT is one of two components being developed to simulate the behavior of neutrons within
reactor cores
MPACT performed multiple simulations that used over 3,000 cores for roughly 3.5 hours.
A result is depicted at right
 One of the most significant milestones CASL has delivered, simulation of
the Watts Bar Unit 1 zero power physics test, was completed successfully in
early June of 2013.
16
Deploying Darter - ISC14, June 24, 2014
Thermal Neutron Flux
Distribution Generated by
MPACT on Darter
2x runtime speedup
compared to XE6
Applications – astrophysics
 UTK-ORNL Astrophysics Theory Group uses their multidimensional radiation
hydrodynamics code, CHIMERA, to make substantial strides in furthering our
understanding of the explosion mechanism in core-collapse supernovae.
– CHIMERA has been used to do supernovae modeling on previous generations of Cray machines
and has easily been ported to run on Darter.
 Core-collapse supernovae are the violent death of massive stars and are among
the most powerful explosions in the universe
– core-collapse supernovae are an important link in our chain of origins from the Big
Bang to the present.
– During their lifetime, massive stars are the dominant sites where elements in the
periodic table between oxygen and iron are synthesized in the process known as
stellar nucleosynthesis.
– Explosion events then disperse these elements to the rest of the universe, making life
as we know it possible.
 Nucleosynthesis does not cease during the explosion event of core-collapse
supernovae.
– Growing evidence that core-collapse supernova are responsible for producing half the
elements heavier than iron in a process known as r-process nucleosynthesis.
17
Deploying Darter - ISC14, June 24, 2014
Applications – astrophysics
3x runtime speedup
compared to XE6
 UTK-ORNL Astro Group is evolving a two-dimensional core-collapse supernova
model from the inner core of a 15 solar mass, non-rotating progenitor on a polar
axisymmetric grid with 720 radial zones and 240 angular zones and a 150-species
nuclear reaction network.
– Early results are promising, and show quantitative differences with simulations from the same
model with an α-network.
– Qualitative comparisons, however, must be postponed until the explosion energy asymptotes and
tracer particle fates are determined, marking the “completion” of the model.
– Prior experience has shown that this is typically achieved after roughly 1.5 seconds of evolution.
After 10 days of computation on 1,920 cores the model has evolved 650 ms in total, and the
average shock radius has reached approximately 880 km
Entropy plot of the model
with 150-species nuclear
network at 300 ms after
core bounce, where some
of the characteristic
behavior of the explosion,
such as SASI [3] is
apparent.
18
Deploying Darter - ISC14, June 24, 2014
Conclusions
 UTK acquired a Cray XC30 supercomputer, called Darter, with a
peak performance of 248.9 Teraflops
 Deployed in late March of 2013 with a very aggressive production
timeline
 Science productivity immediately
– Even supported a nightly operational projects (storm prediction)
– Most applications have seen 2-3x speedup with just a simple recompile (no
optimization)
 Application teams have benefited highly
– Consistency of the Cray multigenerational software stack
– Quick ports and immediate speedups due to processor, interconnect, and compiler
19
Deploying Darter - ISC14, June 24, 2014
Acknowledgments
 We would like to thank
–
–
–
–
Joshua Fu for the extreme climate events highlight
Ming Xue for contribution of the CAPS highlight
John Turner for the contribution of the CASL highlight
James Austin for the contribution of the Astrophysics highlight
 This material is based upon work performed using computational
resources supported by the University of Tennessee and Oak Ridge
National Laboratory Joint Institute for Computational Sciences
20
Deploying Darter - ISC14, June 24, 2014
Thank you
 Questions?
– [email protected]
21
Deploying Darter - ISC14, June 24, 2014
Extra
22
Deploying Darter - ISC14, June 24, 2014
Acceptance Testing
 All production JICS resources are put through a defined set of acceptance tests to
determine that their functionality and performance baselines meet expected parameters.
 Based on the size and expected use of the system, varying criteria are defined as
components of the acceptance test.
 In general, the acceptance tests are divided into hardware and application tests. The
hardware tests include initial diagnostics and testing that confirm the delivered hardware
and firmware are correctly configured and functioning.
 Additionally, the system is configured to utilize the center’s infrastructure including
necessary file systems, user authentication, and network.
 The application tests are further subdivided into functionality, performance, and stability
phases.
 At the end of each phase the application acceptance tests are continuously run on the
system for a specified amount of time.
 This “continuous system integration” step ensures that all parts of the system are tested
under conditions similar to production utilization.
23
Deploying Darter - ISC14, June 24, 2014