Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks Daniele Scarpazza, Oreste Villa, Fabrizio Petrini, Jarek Nieplocha, Vinod Tipparaju, Manoj Krishnan Pacific Northwest National Laboratory Radu Teodoresci, Jun Nakanom Josep Torrellas University of Illinois Duncan Roweth Quadrics In collaboration with Patrick Mullaney, Novell Wayne Augsburger, Mellanox FastOS, Santa Clara CA, June 18 2007 Project Motivation MTBF as a function of system size 160 10,000 100,000 1,000,000 MTBF (hours) 140 120 TeraFLOPS 100 80 60 40 PetaFLOPS 20 0 1000 10000 100000 • Component count in high-end systems has been growing • How do we utilize large (103-5 processor) systems for solving complex science problems? • Fundamental problems – Scalability to massive processor counts – Application performance on single processor given the increasingly complex memory hierarchy – Hardware and software failures System Size (number of nodes) FastOS, Santa Clara CA, June 18 2007 Multiple FT Techniques Application drivers Multidisciplinary, multiresolution, and multiscale nature of scientific problems drive the demand for high end systems Applications place increasingly differing demands on the system resources: disk, network, memory, and CPU Some of them have natural fault resiliency and require very little support System drivers Different I/O configurations, programmable or simple/commodity NICs, proprietary/custom/commodity operating systems Tradeoffs between acceptable rates of failure and cost Cost effectiveness is the main constraint in HPC Therefore, it is not cost-effective or practical to rely on a single fault tolerance approach for all applications and systems FastOS, Santa Clara CA, June 18 2007 Key Elements of SFT BCS ReVive (I/O) Focus of the talk IBA/QsNET Buffered Coscheduling provides global coordination of system activities, communication, CR ReVive and ReVive I/O provide efficient CR capability for shared memory servers (cluster node): Virtualization of High Performance Network Interfaces and Protocols FT ARMCI Fault-Tolerance module for ARMCI runtime system XEN Hypervisor to enable virtualization of compute node environment including OS (external dependency) FastOS, Santa Clara CA, June 18 2007 Transparent System-level CR of PGA Applications on Infiniband and QsNET We explored a new approach to cluster faulttolerance by integrating Xen with the latest generations of Infiniband and Quadrics highperformance networks Focus on Partitioned Global Address Space (PGAs) programming models Most of existing work focused on MPI Design Goals low-overhead and transparent migration FastOS, Santa Clara CA, June 18 2007 Main Contributions Integration of Xen and Infiniband Enhanced Xen’s kernel modules to fully support userlevel Infiniband protocols and IP over IB with minimal overhead Support for Partitioned Global Address space programming models (PGAs) Emphasis on ARMCI Automatic Detection of a Global Recovery Line and Coordinated Migration Perform a live migration without any change to user applications Experimental evaluation FastOS, Santa Clara CA, June 18 2007 Xen Hypervisor On each machine Xen allows the creation of a privileged virtual machine (Dom0) and and one or more non-privileged VMs (DomUs) Xen provides the ability to pause, un-pause, checkpoint and resume DomUs Xen employs para-virtualization Non-privileged domains run a modified operating system featuring guest device drivers Their requests are forwarded to the native device driver in Dom0 using a split driver model FastOS, Santa Clara CA, June 18 2007 Infiniband Device Driver The driver is implemented in two sections A paravirtualized section for slow path control operations (e.g., qpair creation) and A direct access section for fast path data operations (transmit/receive) Based on Ohio State/IBM implementation Driver was extended to support additional CPU architectures and Infiniband adapters Added proxy layer to allow Subnet and Connection management from guest VMs Propagate suspend/resume to the applications not only kernel modules Several stability improvements FastOS, Santa Clara CA, June 18 2007 Software Stack FastOS, Santa Clara CA, June 18 2007 Xen/Infiniband Device Driver: Communication Performance FastOS, Santa Clara CA, June 18 2007 Xen/Infiniband Device Driver: Communication Performance FastOS, Santa Clara CA, June 18 2007 Parallel Programming Models Single Threaded Data Parallel, e.g. HPF Multiple Processes Partitioned-Local Data Access Uniform-Global-Shared Data Access OpenMP Partitioned-Global-Shared Data Access MPI Co-Array Fortran Uniform-Global-Shared + Partitioned Data Access UPC, Global Arrays, X10 FastOS, Santa Clara CA, June 18 2007 Fault Tolerance In PGAs Models Implementation considerations 1-sided communication, perhaps some 2-sided and collectives Special considerations in implementation of global recovery line Memory operations need to be synchronized for checkpoint/restart Memory is a combination of local and global (globally visible) Global memory could be shared from OS view Pinned and registered with network adapter G G G G L L L L P P P P SMP node n SMP node 0 Network FastOS, Santa Clara CA, June 18 2007 Xen-enabled ARMCI Runtime system - one-sided communication Global Arrays, Rice Co-Array Fortran, GPSHEM, IBM X10 port under way Portable high performance remote memory copy interface Fundamental Communication Models in HPC A B receive send P1 P0 message passing 2-sided model Asynchronous remote memory access (RMA) Fast Collective Operations Zero-copy protocols, explicit NIC support “Pure” non-blocking communication - 99.9% overlap A P0 P1 Data Locality remote memory access (RMA) 1-sided model Shared-memory within SMP node and RMA across nodes High performance delivered on wide range of platforms B put Multi-protocol and multi-method implementation Examples of data transfers optimized in ARMCI FastOS, Santa Clara CA, June 18 2007 A B A=B P0 P1 shared memory load/stores 0-sided model Global Recovery Lines (GRLs) A GRL is required before each checkpoint / migration A GRL is required for Infiniband networks because IBA does not allow location-independent layer 2 and 3 addresses IBA hardware maintains stateful connections not accessible by software The protocol that enforces a GRL has A drain phase, which completes any ongoing communication Followed by a global silence, where it is possible to perform node migration And a resume phase, where the processing nodes acquire knowledge of the new network topology FastOS, Santa Clara CA, June 18 2007 GRL and Resource Management FastOS, Santa Clara CA, June 18 2007 Experimental Evaluation Our experimental testbed is a cluster of 8 Dell PowerEdge 1950 Each node has two dual-core Intel Xeon Woodcrest running @ 3.0 GHz, with 8 Gbytes of memory The cluster is interconnected by a Mellanox Infinihost III 4X HCA adapters Suse Linux Enterprise Server 10.0 Xen 3.02 FastOS, Santa Clara CA, June 18 2007 Timing of a GRL FastOS, Santa Clara CA, June 18 2007 Scalability of Network Drain FastOS, Santa Clara CA, June 18 2007 Scalability of Network Resume FastOS, Santa Clara CA, June 18 2007 Save and Restore Latencies FastOS, Santa Clara CA, June 18 2007 Migration Latencies FastOS, Santa Clara CA, June 18 2007 Conclusion We have presented a novel software infrastructure that allows completely transparent checkpoint/restart We have implemented a device driver that enhances the existing Xen/Infiniband drivers Support for PGAs programming models Minimal overhead, 10s of milliseconds Most of the time is spent saving/restoring the node image FastOS, Santa Clara CA, June 18 2007