Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3 Computer and Computational Sciences Division Los Alamos National Laboratory PAL Overview CCS-3 In this part of the tutorial we will discuss the characteristics of some of the most powerful supercomputers We classify these machines along three dimensions Node Integration - how processors and network interface are integrated in a computing node Network Integration – what primitive mechanisms the network provides to coordinate the processing nodes System Software Integration – how the operating system instances are globally coordinated Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 2 PAL Overview CCS-3 We argue that the level of integration in each of the three dimensions, more than other parameters (as distributed vs shared memory or vector vs scalar processor), is the discriminating factor beween large-scale supercomputers In this part of the tutorial we will briefly characterize some existing and up-coming parallel computers Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 3 ASCI Q: Los Alamos National Laboratory PAL CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 4 PAL ASCI Q CCS-3 Total — 20.48 TF/s, #3 in the top 500 Systems — 2048 AlphaServer ES45s 8,192 EV-68 1.25-GHz CPUs with 16-MB cache Memory System — 22 Terabytes Interconnect Dual Rail Quadrics Interconnect 4096 QSW PCI adapters Four 1024-way QSW federated switches Operational in 2002 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 5 PAL Node: HP (Compaq) AlphaServer ES45 21264 System Architecture CCS-3 EV68 1.25 GHz Each @ 64b 500 MHz (4.0 GB/s) Memory Up to 32 GB MMB 0 256b 125 MHz (4.0 GB/s) Quad C-Chip Controller MMB 1 D D D D D D D D MMB 2 256b 125 MHz Cache 16 MB per CPU (4.0 GB/s) PCI Chip Bus Bus0,1 0 MMB 3 PCI Chip Bus Bus2,3 1 64b 66MHz (528 MB/S) 64b MB/S) 64b66MHz 33MHz(528 (266MB/S) 64b 33 33MHz 64b MHz(266MB/S) (266 MB/S) 64b 66 MHz (528 MB/S) PCI0 PCI0 PCI1 PCI1 PCI1 HS PCI2 PCI2 PCI2 HS PCI3 PCI3 PCI3 HS PCI4 PCI4 3.3V I/O PCI5 PCI5 PCI6 PCI6 PCI6 HS PCI7 HS PCI7 PCI7 PCI8 PCI8 PCI8 HS PCI9 PCI PCI9HS 9 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov PCI-USB PCI-USB PCI-junk IO IO PCI-junk 5.0V I/O Europar 2004, Pisa Italy 6 Serial, Parallel keyboard/mouse floppy QsNET: Quaternary Fat Tree PAL CCS-3 • Hardware support for Collective Communication • MPI Latency 4ms, Bandwidth 300 MB/s • Barrier latency less than 10ms Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 7 PAL Interconnection Network Super Top Level 6 Switch Level CCS-3 5 Mid Level 4 ... 3 1024 nodes (2x = 2048 nodes) 2 16th 64U64D Nodes 960-1023 1st 64U64D Nodes 0-63 1 48 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov 63 960 Europar 2004, Pisa Italy 1023 8 PAL System Software CCS-3 Operating System is Tru64 Nodes organized in Clusters of 32 for resource allocation and administration purposes (TruCluster) Resource management executed through Ethernet (RMS) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 9 PAL ASCI Q: Overview CCS-3 Node Integration Network Integration Low (multiple boards per node, network interface on I/O bus) High (HW support for atomic collective primitives) System Software Integration Medium/Low (TruCluster) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 10 ASCI Thunder, 1,024 Nodes, 23 TF/s peak PAL CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 11 ASCI Thunder, Lawrence Livermore National Laboratory PAL CCS-3 • 1,024 Nodes, 4096 Processors, 23 TF/s, •#2 in the top 500 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 12 PAL ASCI Thunder: Configuration CCS-3 1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total) 2.5 ms, 912 MB/s MPI latency and bandwidth over Quadrics Elan4 Barrier synchronization 6 ms, allreduce 15 ms 75 TB in local disk in 73GB/node UltraSCSI320 Lustre file system with 6.4 GB/s delivered parallell I/O performance Linux RH 3.0, SLURM, Chaos Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 13 PAL CCS-3 CHAOS: Clustered High Availability Operating System Derived from Red Hat, but differs in the following areas Modified kernel (Lustre and hw specific) New packages for cluster monitoring, system installation, power/console management SLURM, an open-source resource manager Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 14 PAL ASCI Thunder: Overview CCS-3 Node Integration Network Integration Medium/Low (network interface on I/O bus) Very High (HW support for atomic collective primitives) System Software Integration Medium (Chaos) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 15 PAL System X: Virginia Tech CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 16 PAL System X, 10.28 TF/s 1100 dual Apple G5 2GHz CPU based nodes. 8 billion operations/second/processor (8 GFlops) peak double precision floating performance. Each node has 4GB of main memory and 160 GB of Serial ATA storage. CCS-3 176TB total secondary storage. Infiniband, 8ms and 870 MB/s, latency and bandwidth, partial support for collective communication System-level Fault-tolerance (Déjà vu) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 17 PAL System X: Overview CCS-3 Node Integration Network Integration Medium/Low (network interface on I/O bus) Medium (limited support for atomic collective primitives) System Software Integration Medium (system-level fault-tolerance) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 18 PAL BlueGene/L System System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) Node Card (32 chips, 4x4x2) 16 Compute Cards Compute Card (2 chips, 2x1x1) 180/360 TF/s 16 TB DDR Chip (2 processors) 90/180 GF/s 8 GB DDR 2.8/5.6 GF/s 4 MB Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov 2.9/5.7 TF/s 256 GB DDR 5.6/11.2 GF/s 0.5 GB DDR Europar 2004, Pisa Italy 19 CCS-3 PAL BlueGene/L Compute ASIC CCS-3 PLB (4:1) 32k/32k L1 256 128 L2 440 CPU 4MB EDRAM “Double FPU” snoop Multiported Shared SRAM Buffer 256 32k/32k L1 440 CPU I/O proc 128 Shared L3 directory for EDRAM 1024+ 144 ECC L3 Cache or Memory L2 256 Includes ECC 256 “Double FPU” 128 • IBM CU-11, 0.13 µm • 11 x 11 mm die size • 25 x 32 mm CBGA • 474 pins, 328 signal • 1.5/2.5 Volt Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Ethernet Gbit Gbit Ethernet JTAG Access JTAG Torus Tree 6 out and 3 out and 6 in, each at 3 in, each at 1.4 Gbit/s link 2.8 Gbit/s link Europar 2004, Pisa Italy DDR Control with ECC Global Interrupt 4 global barriers or interrupts 20 144 bit wide DDR 256/512MB PAL CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 21 PAL CCS-3 16 compute cards 2 I/O cards DC-DC Converters: 40V 1.5, 2.5V Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 22 PAL CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 23 BlueGene/L Interconnection Networks PAL CCS-3 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) 350/700 GBytes/s bisection bandwidth Communications backbone for computations Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traversal in the order of 5 µs Interconnects all compute and I/O nodes (1024) Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier 8 single wires crossing whole system, touching all nodes Control Network (JTAG) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov For booting, checkpointing, error logging Europar 2004, Pisa Italy 24 PAL BlueGene/L System Software Organization CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) I/O nodes run Linux and provide O/S services file access process launch/termination debugging Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software Europar 2004, Pisa Italy 25 Operating Systems Compute nodes: CNK Specialized simple O/S 5000 lines of code, 40KBytes in core No thread support, no virtual memory Protection Protect kernel from application Some net devices in userspace File I/O offloaded (“function shipped”) to IO nodes Through kernel system calls “Boot, start app and then stay out of the way” I/O nodes: Linux 2.4.19 kernel (2.6 underway) w/ ramdisk NFS/GPFS client CIO daemon to Start/stop jobs Execute file I/O Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov PAL CCS-3 Global O/S (CMCS, service node) Invisible to user programs Global and collective decisions Interfaces with external policy modules (e.g., job scheduler) Commercial database technology (DB2) stores static and dynamic state Partition selection Partition boot Running of jobs System error logs Checkpoint/restart mechanism Scalability, robustness, security Execution mechanisms in the core Policy decisions in the service node Europar 2004, Pisa Italy 26 PAL BlueGeneL: Overview CCS-3 Node Integration Network Integration High (separate tree network) System Software Integration High (processing node integrates processors and network interfaces, network interfaces directly connected to the processors) Medium/High (Compute kernels are not globally coordinated) #2 and #4 in the top500 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 27 PAL Cray XD1 CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 28 PAL Cray XD1 System Architecture CCS-3 Compute 12 AMD Opteron 32/64 bit, x86 processors High Performance Linux RapidArray Interconnect 12 communications processors 1 Tb/s switch fabric Active Management Dedicated processor Application Acceleration 6 co-processors Processors directly connected to the interconnect Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 29 Cray XD1 Processing Node PAL CCS-3 Six 2-way SMP Blades 4 Fans Six SATA Hard Drives 500 Gb/s crossbar switch Chassis Front 12-port Interchassis connector Four independent PCI-X Slots Connector to 2nd 500 Gb/s crossbar switch and 12-port inter-chassis connector Chassis Rear Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 30 Cray XD1 Compute Blade CCS-3 4 DIMM Sockets for DDR 400 Registered ECC Memory AMD Opteron 2XX Processor Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov PAL RapidArray Communications Processor AMD Opteron 2XX Processor 4 DIMM Sockets for DDR 400 Registered ECC Memory Connector to Main Board Europar 2004, Pisa Italy 31 PAL Fast Access to the Interconnect CCS-3 GigaBytes Memory Xeon Server Cray XD1 GFLOPS Processor GigaBytes per Second I/O 1 GB/s PCI-X Interconnect 0.25 GB/s GigE 5.3 GB/s DDR 333 6.4GB/s DDR 400 8 GB/s Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 32 Communications Optimizations PAL CCS-3 RapidArray Communications Processor HT/RA tunnelling with bonding Routing with route redundancy Reliable transport Short message latency optimization DMA operations System-wide clock synchronization RapidArray Communications Processor AMD Opteron 2XX Processor 2 GB/s 3.2 GB/s 2 GB/s Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 33 Active Manager System PAL CCS-3 Usability Single System Command and Control Resiliency Active Management Software Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Dedicated management processors, real-time OS and communications fabric. Proactive background diagnostics with selfhealing. Synchronized Linux kernels Europar 2004, Pisa Italy 34 PAL Cray XD1: Overview CCS-3 Node Integration Network Integration Medium/High (HW support for collective communication) System Software Integration High (direct access from HyperTransport to RapidArray) High (Compute kernels are globally coordinated) Early stage Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 35 PAL ASCI Red STORM CCS-3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 36 PAL Red Storm Architecture CCS-3 Distributed memory MIMD parallel supercomputer Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network 108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz) ~10 TB of DDR memory @ 333MHz Red/Black switching: ~1/4, ~1/2, ~1/4 8 Service and I/O cabinets on each end (256 processors for each color240 TB of disk storage (120 TB per color) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 37 PAL Red Storm Architecture CCS-3 Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes Separate RAS and system management network (Ethernet) Router table-based routing in the interconnect Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 38 PAL Red Storm architecture CCS-3 Service Compute File I/O Users /home Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Net I/O Europar 2004, Pisa Italy 39 PAL System Layout (27 x 16 x 24 mesh) CCS-3 Switchable Nodes Normally Classified { { Normally Unclassified Disconnect Cabinets Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 40 PAL Red Storm System Software CCS-3 Run-Time System Logarithmic loader Fast, efficient Node allocator Batch system – PBS Libraries – MPI, I/O, Math File Systems being considered include PVFS – interim file system Lustre – Pathforward support, Panassas… Operating Systems LINUX on service and I/O nodes Sandia’s LWK (Catamount) on compute nodes LINUX on RAS nodes Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 41 PAL ASCI Red Storm: Overview CCS-3 Node Integration Network Integration Medium (No support for collective communication) System Software Integration High (direct access from HyperTransport to network through custom network interface chip) Medium/High (scalable resource manager, no global coordination between nodes) Expected to become the most powerful machine in the world (competition permitting) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 42 PAL Overview CCS-3 Node Integration Network Integration Software Integration ASCI Q ASCI Thunder System X BlueGene/L Cray XD1 Red Storm Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 43 PAL A Case Study: ASCI Q CCS-3 We try to provide some insight on the what we perceive are the important problems in a large-scale supercomputer Our hands-on experience on ASCI Q shows that the system software and the global coordination are fundamental in a large-scale parallel machine Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 44 PAL ASCI Q CCS-3 2,048 ES45 Alphaservers, with 4 processors/node 16 GB of memory per node 8,192 processors in total 2 independent network rails, Quadrics Elan3 > 8192 cables 20 Tflops peak, #2 in the top 500 lists A complex human artifact Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 45 Dealing with the complexity of a real system PAL CCS-3 In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q. This methodology is based on an arsenal of analytical models custom microbenchmarks full applications discrete event simulators Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 46 PAL Overview CCS-3 Our performance expectations for ASCI Q and the reality Identification of performance factors Application performance and breakdown into components Detailed examination of system effects A methodology to identify operating systems effects Effect of scaling – up to 2000 nodes/ 8000 processors Quantification of the impact Towards the elimination of overheads demonstrated over 2x performance improvement Generalization of our results: application resonance Bottom line: the importance of the integration of the various system across nodes Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 47 Performance of SAGE on 1024 nodes PAL CCS-3 Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) Measured time 2x greater than model (4096 PEs) SAGE Performance (QA & QB) 1.2 There is a difference why ? Cycle-time (s) 1 0.8 0.6 0.4 Model 0.2 Sep-21-02 Nov-25-02 0 0 512 1024 1536 2048 2560 3072 3584 # PEs Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 4096 Lower is better! 48 PAL Using fewer PEs per Node CCS-3 Test performance using 1,2,3 and 4 PEs per node Sage on QB (timing.input) 1.4 1PEsPerNode 1.2 2PEsPerNode 3PEsPerNode Cycle Time (s) 1 4PEsPerNode 0.8 0.6 0.4 0.2 Lower is better! 0 1 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov 10 100 #PEs Europar 2004, Pisa Italy 1000 10000 49 Using fewer PEs per node (2) PAL Measurements match model almost exactly for 1,2 and 3 PEs per node! Sage on QB (timing.input) 0.6 Error (s) - Measured - Model 1PEsPerNode 0.5 2PEsPerNode 3PEsPerNode 0.4 4PEsPerNode 0.3 0.2 0.1 0 1 10 100 #PEs 1000 10000 Performance issue only occurs when using 4 PEs per node Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 50 CCS-3 PAL Mystery #1 CCS-3 SAGE performs significantly worse on ASCI Q than was predicted by our model Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 51 SAGE performance components PAL CCS-3 Look at SAGE in terms of main components: Put/Get (point-to-point boundary exchange) Collectives (allreduce, broadcast, reduction) SAGE on QB - Breakdown (timing.input) 1.4 token_allreduce 1.2 token_bcast token_get Time/Cycle (s) 1 token_put token_reduction 0.8 cyc_time 0.6 0.4 0.2 0 1 10 100 #PEs 1000 10000 Performance issue seems to occur only on collective operations Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 52 Performance of the collectives PAL Measure collective performance separately •Allreduce Latency •3 •Latency ms •2.5 •1 process per node •2 processes per node •3 processes per node •4 processes per node •2 •1.5 •1 •0.5 •0 •0 •100 •200 •300 •400 •500 •600 •700 •800 •900 •1000 •Nodes Collectives (e.g., allreduce and barrier) mirror the performance of the application Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 53 CCS-3 Identifying the problem within Sage PAL CCS-3 Sage Simplify Allreduce Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 54 Exposing the problems with simple PAL benchmarks CCS-3 Allreduce Challenge: identify the simplest Add benchmark that exposes the problem complexity Benchmarks Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 55 Interconnection network and communication libraries PAL CCS-3 The initial (obvious) suspects were the interconnection network and the MPI implementation We tested in depth the network, the low level transmission protocols and several allreduce algorithms We also implemented allreduce in the Network Interface Card By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7 But we only got small improvements in Sage (5%) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 56 PAL Mystery #2 CCS-3 Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce 7 times faster leads to a small performance improvement Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 57 PAL Computational noise CCS-3 After having ruled out the network and MPI we focused our attention on the compute nodes Our hypothesis is that the computational noise is generated inside the processing nodes This noise “freezes” a running process for a certain amount of time and generates a “computational” hole Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 58 Computational noise: intuition PAL CCS-3 Running 4 processes on all 4 processors of an Alphaserver ES45 The computation of one process is interrupted by an external event (e.g., system daemon or kernel) P0 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov P1 Europar 2004, Pisa Italy P2 P3 59 Computational noise: 3 processes on 3 processors PAL CCS-3 Running 3 processes on 3 processors of an Alphaserver ES45 The “noise” can run on the 4th processor without interrupting the other 3 processes P0 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov P1 Europar 2004, Pisa Italy P2 IDLE 60 Coarse grained measurement PAL CCS-3 We execute a computational loop for 1,000 seconds on all 4,096 processors of QB •START •END •P •1 •P •2 •P •3 •P •4 •TIME Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 61 Coarse grained computational overhead per process PAL CCS-3 The slowdown per process is small, between 1% and 2.5% lower is better Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 62 PAL Mystery #3 CCS-3 Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the microbenchmarks of per-processor noise indicate that at most 2.5% of performance is lost to noise Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 63 PAL Fine grained measurement CCS-3 We run the same benchmark for 1000 seconds, but we measure the run time every millisecond Fine granularity representative of many ASCI codes Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 64 Fine grained computational overhead PAL per node CCS-3 We now compute the slowdown per-node, rather than perprocess The noise has a clear, per cluster, structure Optimum is 0 (lower is better) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 65 PAL Finding #1 CCS-3 Analyzing noise on a per-node basis reveals a regular structure across nodes Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 66 PAL Noise in a 32 Node Cluster The Q machine is organized in 32 node clusters (TruCluster) CCS-3 In each cluster there is a cluster manager (node 0), a quorum node (node 1) and the RMS data collection (node 31) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 67 PAL Per node noise distribution CCS-3 Plot distribution of one million, 1 ms computational chunks In an ideal, noiseless, machine the distribution graph is a single bar at 1 ms of 1 million points per process (4 million per node) Every outlier identifies a computation that was delayed by external interference We show the distributions for the standard cluster node, and also nodes 0, 1 and 31 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 68 PAL Cluster Node (2-30) 10% of the times the execution of the 1 ms chunk of computation is delayed Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 69 CCS-3 PAL Node 0, Cluster Manager CCS-3 We can identify 4 main sources of noise Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 70 PAL Node 1, Quorum Node One source of heavyweight noise (335 ms!) Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 71 CCS-3 PAL Node 31 CCS-3 Many fine grained interruptions, between 6 and 8 milliseconds Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 72 PAL The effect of the noise An application is usually a sequence of a computation followed by a synchronization (collective): .. .. .. .. CCS-3 .. But if an event happens on a single node then it can affect all the other nodes .. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov .. .. Europar 2004, Pisa Italy .. 73 PAL Effect of System Size CCS-3 .. .. .. .. The probability of a random event occurring increases with the node count. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 74 Tolerating Noise: Buffered Coscheduling (BCS) PAL CCS-3 ... .. .. .. .. .. .. We can tolerate the noise by coscheduling the activities of the system software on each node Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 75 Discrete Event Simulator: used to model noise PAL CCS-3 DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise Noise model closely approximates experimental data The primary bottleneck is the fine-grained noise generated by the compute nodes (Tru64) Lower is better Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 76 PAL Finding #2 CCS-3 On fine-grained applications, more performance is lost to short but frequent noise on all nodes than to long but less frequent noise on just a few nodes Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 77 Incremental noise reduction PAL CCS-3 1. 2. 3. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff) decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s) moved several daemons from nodes 1 and 2 to node 0 on each cluster. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 78 Improvements in the Barrier Synchronization Latency Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 79 PAL CCS-3 Resulting SAGE Performance PAL 1.4 1.4 1.2 1.2 1 1 Cycle-time (s) Cycle-time (s) CCS-3 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.8 Model Sep-21-02 Model Nov-25-02 Sep-21-02 Jan-27-03 Nov-25-02 May-01-03 Jan-27-03 (Min) Jan-27-03 May-01-03 (min) 0 0 512 10243072 1536 4096 2048 2560 3584 7168 4096 1024 2048 512030726144 PEs ## PEs Nodes 0 and 31 also configured out in the optimization Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 80 8192 PAL Finding #3 CCS-3 We were able to double SAGE’s performance by selectively removing noise caused by several types of system activities Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 81 PAL Generalizing our results: application resonance The computational granularity of a balanced bulksynchronous application correlates to the type of noise. Intuition: any noise source has a negative impact, a few noise sources tend to have a major impact on a given application. Rule of thumb: CCS-3 the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude The performance can be enhanced by selectively removing sources of noise We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 82 Cumulative Noise Distribution, Sequence PAL of Barriers with No Computation CCS-3 Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 83 PAL Conclusions CCS-3 Combination of Measurement, Simulation and Modeling to Identify and resolve performance issues on Q Used modeling to determine that a problem exists Developed computation kernels to quantify O/S events: Effect increases with the number of nodes Impact is determined by the computation granularity in an application Application performance has significantly improved Method also being applied to other large-systems Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 84 PAL About the authors CCS-3 Kei Davis is a team leader and technical staff member at Los Alamos National Laboratory (LANL) where he is currently working on system software solutions for reliability and usability of large-scale parallel computers. Previous work at LANL includes computer system performance evaluation and modeling, large-scale computer system simulation, and parallel functional language implementation. His research interests are centered on parallel computing; more specifically, various aspects of operating systems, parallel programming, and programming language design and implementation. Kei received his PhD in Computing Science from Glasgow University and his MS in Computation from Oxford University. Before his appointment at LANL he was a research scientist at the Computing Research Laboratory at New Mexico State University. Fabrizio Petrini is a member of the technical staff of the CCS3 group of the Los Alamos National Laboratory (LANL). He received his PhD in Computer Science from the University of Pisa in 1997. Before his appointment at LANL he was a research fellow of the Computing Laboratory of the Oxford University (UK), a postdoctoral researcher of the University of California at Berkeley, and a member of the technical staff of the Hewlett Packard Laboratories. His research interests include various aspects of supercomputers, including high-performance interconnection networks and network interfaces, job scheduling algorithms, parallel architectures, operating systems and parallel programming languages. He has received numerous awards from the NNSA for contributions to supercomputing projects, and from other organizations for scientific publications. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 85