Download FAST-OS BOF SC 04 - Department of Computer Science

FAST-OS BOF SC 04 http://www.cs.unm.edu/~fastos Follow link to subscribe to the mail list Projects • Colony Terry Jones, LLNL • Config Framework Ron Brightwell, SNL • DAiSES Pat Teller, UTEP • K42 Paul Hargrove, LBNL • MOLAR Stephen Scott, ORNL • Peta-Scale SSI Scott Studham, ORNL • Rightweight Kernels Ron Minnich, LANL • Scalable FT Jarek Nieplocha, PNNL • SmartApps L. Rauchwerger, T A&M • ZeptoOS Pete Beckman, ANL www.HPC-Colony.org Services & Interfaces For Very Large Linux Clusters Terry Jones, LLNL, Coordinating PI Laxmikant Kale, UIUC, PI Jose Moreira, IBM, PI Celso Mendes, UIUC Derek Lieber, IBM Colony Overview Collaborators Lawrence Livermore National Laboratory Title Services and Interfaces to Support Systems with Very Large Numbers of Processors Topics • University of Illinois at Urbana-Champaign International Business Machines • • • • • • Parallel Resource Instrumentation Framework Scalable Load Balancing OS mechanisms for Migration Processor Virtualization for Fault Tolerance Single system management space Parallel Awareness and Coordinated Scheduling of Services Linux OS for cellular architecture Colony Motivation • Parallel resource management Strategies for scheduling and load balancing must be improved. Difficulties in achieving a balanced partitioning and dynamically scheduling workloads can limit scaling for complex problems on large machines. • Global system management System management is inadequate. Parallel jobs require common operating system services, such as process scheduling, event notification, and job management to scale to large machines. Colony Goals • Develop infrastructure and strategies for automated parallel resource management – Today, application programmers must explicitly manage these resources. We address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS. – “Managing Resources” includes balancing CPU time, network utilization, and memory usage across the entire machine. • Develop a set of services to enhance the OS to improve its ability to support systems with very large numbers of processors – We will improve operating system awareness of the requirements of parallel applications. – We will enhance operating system support for parallel execution by providing coordinated scheduling and improved management services for very large machines. Colony Approach • Top Down – Our work will start from an existing full-featured OS and remove excess baggage with a “top down” approach. • Processor virtualization – One of our core techniques: the programmer divides the computation into a large number of entities, which are mapped to the available processors by an intelligent runtime system. • Leverage Advantages of Full Featured OS & Single System Image – Applications on these extreme-scale systems will benefit from extensive services and interfaces; managing these complex systems will require an improved “logical view” • Utilize Blue Gene – Suitable platform for ideas intended for very large numbers of processors Configurable OS Framework • Sandia, lead – Ron Brightwell, PI – Rolf Riesen • Caltech – Thomas Sterling, PI • UNM – Barney Maccabe, PI – Patrick Bridges Issues • Novel architectures – Lots of execution environments • Programming models – MPI, UPC, separating processing from location • Shared services – File systems, shared WAN • Usage model – Dedicated, space shared, time shared Approach • Build application specific OS – Architecture, programming model, shared resources, usage model • Develop a collection of Micro services – Compose and distribute • Compose services – Services may adapt • Kinds of services – Memory allocation, signal delivery, message receipt and handler activation The Picture Challenges • How to reason about combinations • Dependencies among services • Efficiency – Overhead associated with transfers between micro services • How many operating systems will we really need? Goals Dynamic Adaptability in Support of Extreme Scale Generalized Customized resource management Fixed Dynamically Adaptable OS/runtime services Enhanced Performance Challenges Dynamic Adaptability in Support of Extreme Scale Determining • What to adapt • When to adapt • How to adapt • How to measure effects of adaptation Deliverables Dynamic Adaptability in Support of Extreme Scale • Develop mechanisms to dynamically sense, analyze, and adjust common performance metrics, fluctuating workload situations, and overall system environment conditions • Demonstrate, via Linux prototypes and experiments, dynamic self-tuning/provisioning in HPC environments • Develop a methodology for general-purpose OS adaptation Methodology Dynamic Adaptability in Support of Extreme Scale identify adaptation targets characterize workload resource usage patterns potential adaptation targets off line (re)determine adaptation intervals off line/ run time define/adapt heuristics to trigger adaptation generate/adapt monitoring, triggering and adaptation code, and attach it to OS monitor application execution, triggering adaptation as necessary KernInst KernInst Dynamic Adaptability in Support of Extreme Scale dynamic instrumentation of the kernel IBM pSeries eServer 690 Client KernInst Daemon Instrumentation Tool KernInst Device KernInst API Linux Kernel • KernInst and Kperfmon provide the capability to perform dynamic monitoring and adaptation of commodity operating systems. • University of Wisconsin’s KernInst and Kperfmon make the problem of run-time monitoring and adaptation more tractable. Example Adaptations Dynamic Adaptability in Support of Extreme Scale Customization of • process scheduling parameters and algorithms, e.g., scheduling policy for different job types (prototype in process) • file system cache size and management • disk cache management • size of OS buffers and tables • I/O, e.g., checkpoint/restart • memory allocation and management parameters and algorithms Partners Dynamic Adaptability in Support of Extreme Scale University of Texas at El Paso Department of Computer Science Patricia J. Teller ([email protected]) University of Wisconsin — Madison Computer Sciences Department Barton P. Miller ([email protected]) International Business Machines, Inc. Linux Technology Center Bill Buros ([email protected]) U.S. Department of Energy Office of Science Fred Johnson ([email protected]) C O M P U T A T I O N A L R E S E A R C H D I V I S I O N High End Computing with K42 Paul H. Hargrove and Katherine Yelick Lawrence Berkeley National Lab Angela Demke Brown and Michael Stumm University of Toronto Patrick Bridges University of New Mexico Orran Krieger and Dilma Da Silva IBM K42 Project Motivation • The HECRTF and FastOS reports enumerate unmet needs in the area of Operating Systems for HEC, including – – – – – Availability of Research Frameworks Support for Architectural Innovation Performance Visibility Ease of Use Adaptability to Application Requirements • This project uses the K42 Operating System to address these five needs K42 K42 Background • K42 is a research OS from IBM – API/ABI compatibility with Linux – Designed for large 64-bit SMPs – Extensible object-oriented design • Features per resource-instance objects • Can change implementation/policy for individual instances at runtime – Extensive performance-monitoring – Many traditional OS functions are performed in user-space libraries K42 What Work Remains? (1 of 2) • Availability of Research Frameworks & Support for Architectural Innovation  K42 is already a research platform, used by IBM for their PERCS project (DARPA HPCS) to support architectural innovation  Work remains to expand K42 from SMPs to clusters • Performance Visibility  Existing facilities are quite extensive  Work remains to use runtime replacement of object implementations to monitoring single objects for fine-grained control K42 What Work Remains? (2 of 2) • Ease of Use  Work remains to make K42 widely available, and to bring HEC user environments to K42 (e.g. MPI, batch systems, etc.) • Adaptability to Application Requirements  Runtime replacement of object implementations provides extreme customizability  Work remains to provide implementations appropriate to HEC, and to perform automatic dynamic adaptation MOLAR: Modular Linux and Adaptive Runtime Support for High-end Computing Operating and Runtime Systems Coordinating Principal Investigator Stephen L. Scott, ORNL [email protected] Principal Investigators J. Vetter, D.E. Bernholdt, C. Engelmann – ORNL C. Leangsuksun – Louisiana Tech University P. Sadayappan – Ohio State University F. Mueller – North Carolina State University Collaborators A.B. Maccabe – University of New Mexico C. Nuss, D. Mason – Cray Inc. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY MOLAR MOLAR research goals • Create a modular and configurable Linux system that allows customized changes based on the requirements of the applications, runtime systems, and cluster management software. • Build runtime systems that leverage the OS modularity and configurability to improve efficiency, reliability, scalability, ease-of-use, and provide support to legacy and promising programming models. • Advance computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. • Explore the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions. MOLAR MOLAR: Modular Linux and Adaptive Runtime support High-end Computing OS Research Map HEC Linux OS: modular, custom, light-weight Kernel design Performance Observation Communications, IO Monitoring Extend/adapt runtime/OS Root cause analysis RAS High availability Testbeds Provided PROBLEM: • Current OSs and runtime systems (OS/R) are unable to meet the various requirements to run large applications efficiently on future ultra-scale computers. GOALS: • Development of a modular and configurable Linux framework. • Runtime systems to provide a seamless coordination between system levels. • Monitoring and adaptation of the operating system, runtime, and applications. • Reliability, availability, and serviceability (RAS) • Efficient system management tools. IMPACT: • Enhanced support and better understanding of extremely scalable architectures. • Proof-of-concept implementation open to community researchers. MOLAR MOLAR crosscut capability deployed for RAS • Monitoring Core Daemon • service monitor • resource monitor • hardware health monitor • Head nodes: active / hot standby • Services: active / hot standby • Modular Linux systems deployment & development MOLAR MOLAR Federated System Management (fSM) • fSM emphasizes simplicity • self-build • self-configuration • self-healing • simplified operation • Expand MOLAR support: • Investigate specialized architectures • Investigate other environments & OSs • Head nodes: active / active • Services: active / active Peta-Scale Single-System Image A framework for a single-system image Linux environment for 100,000+ processors and multiple architectures Coordinating Investigator R. Scott Studham, ORNL Principal Investigators Alan Cox, Rice University Bruce Walker, HP Investigators Peter Druschel, Rice University Scott Rixner, Rice University Collaborators Peter Braam, CFS Steve Reinhardt, SGI Stephen Wheat, Intel Peta-Scale SSI Project Key Objectives            OpenSSI to 10,000 nodes Integration of OpenSSI with nodes with high processor counts The scalability of a shared root filesystem to 10,000 nodes Scalable booting and monitoring mechanisms Research enhancements to OpenSSI’s P2P communications The use of very large page sizes (superpages) for large address spaces Determine the proper interconnect balance as it impacts the operating system (OS) Establish system-wide tools and process management for a 100,000 processor environment OS noise (services that interrupt computation) effects Integrating a job scheduler with the OS Preemptive task migration. Peta-Scale SSI Reduce OS-Noise and increase cluster scalability via efficient compute nodes Install and sysadmin Boot and Init Devices IPC Application monitoring and restart HA Resource Mgmt and Job Scheduling MPI Boot Process load leveling CLMS Cluster Filesystem CFS MPI DLM Vproc Remote File Block Lustre client ICS Service Nodes single install; local boot (for HA); single IP (LVS) connection load balancing (LVS); single root with HA (Lustre): single file system namespace (Lustre); single IPC namespace; single process space and process load leveling; application HA strong/strict membership; CLMS Lite LVS Vproc Lustre client Remote File Block ICS Compute Nodes single install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre); single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling; no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only Peta-Scale SSI Researching the intersection of SSI and large kernels to get to 100,000+ processors 2048 CPUs 2) Enhance scalability of both approaches 3) Understand intersection of both methods Single Linux Kernel 1) Establish scalability baselines Continue SGIs work on single kernel scalability Test the intersection large kernels with software OpenSSI to establish sweat spot for Continue the OpenSSI’s 100,000 processor Linux work on Typical SSISSI scalability environments Stock Linux Kernel 1 CPU 1 Node Software SSI Clusters 10,000 Nodes Right-Weight Kernels The right kernel, in the right place, at the right time RWK OS effect on Parallel Applications • Simple problem: if all processors save one arrive at a join, then all wait for the laggard [Mraz SC ’94] – Mraz resolved the problem for AIX, interestingly, with purely local scheduling decisions (i.e., no global scheduler) – Sandia resolved it by getting rid of the OS entirely (i.e., creation of the “Light-Weight Kernel”) • AIX has more capability than many apps need • LWK has less capability than many apps want RWK Hence Right-Weight Kernels • Customize the kernel to the app • We’re looking at two different approaches • Customized, Modular Linux – Based on 2.6 – With some scheduling enhancements • “COTS” Secure LWK – Based, after some searching, on Plan 9 – With some performance enhancements RWK Balancing Capability and Overhead increasing per node capability RWK AIX, Tru64, Solaris, Linux, etc. RWK No OS RWK decreasing OS impact on app • We need to balance the capabilities that an full OS gives the user with the overhead of providing such services • For a given app, we want to be as close to the “optimal” balance as possible • But how do we measure what that is? RWK Measuring what is “good” • OS activity is periodic, thus we need to use techniques such as time series analysis to evaluate the measured data – Use this data to figure out what is “good” and “bad” • Caveat: you must practice good sampling hygiene [Sottile & Minnich, Cluster ’04] – Must follow rules of statistical sampling – Measuring work per unit of time leads to statistically sound data – Measuring time per unit of work leads to meaningless data RWK Conclusions • Use sound statistical measurement techniques to figure out what is “good” • Configure compute nodes on a per app basis (Right-Weight Kernel) • Rinse and repeat! • Collaborators – Sung-Eun Choi, Matt Sottile, Erik Hendriks (LANL) – Eric Grosse, Jim McKie, Vic Zandy (Bell Labs) SFT: Scalable Fault Tolerant Runtime and Operating Systems Pacific Northwest National Laboratory Los Alamos National Laboratory University of Illinois Quadrics SFT Team • Jarek Nieplocha, PNNL • Fabrizio Petrini and Kei Davis (LANL) • Josep Torrellas and Yuanyuan Zhou (UIUC) • David Addison (Quadrics) • Industrial Partner: Stephen Wheat (Intel) SFT Motivation • With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. • Application Driver – Multidisciplinary, multiresolution, and multiscale nature of scientific problems – drive the demand for high end systems – applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. • Therefore, it will not be cost-effective or practical to rely on a single fault tolerance approach for all applications. SFT Goals • Develop scalable and practical techniques for addressing fault tolerance at the Operating System and Runtime levels – Design based on requirements of DoE applications – Minimal impact on application performance SFT Petaflop Architecture ... ... processors interconnection network memories Tightly coupled node Globally addressable but non-coherent between nodes SFT Scope • We will investigate, develop, and evaluate a comprehensive range of techniques for fault tolerance. – System level incremental checkpointing approach • based on Buffered CoScheduling • temporal and spatial hybrid checkpointing • in-memory checkpointing and efficient handling of I/O – Fault awareness in communication libraries • while exploiting high performance network communication • MPI, ARMCI • scalability – Feasibility analysis of incremental checkpointing SFT Buffered CoScheduling SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger http://parasol.tamu.edu/~rwerger/ Parasol Lab, Dept of Computer Science, Texas A&M SmartApps Today: System Centric Computing System-Centric Computing Classic avenues to performance: •Parallel Algorithms •Static Compiler Optimization •OS support •Good Architecture Application (algorithm) Development, Analysis & Optimization Application Compiler Compiler (static) OS HW Execution System (OS & Arch) Input Data WHAT’s MISSING ? •Compilers are conservative No Global Optimization •OS offers generic services •No matching between Application/OS/HW •Architecture is generic •intractable for the general case SmartApps Our Approach: SmartApps Application Centric Computing Application (algorithm) Application-Centric Computing Development, Analysis & Optimization HW Compiler (static) + run-time techniques Compiler Run-time System: Execution, Analysis & Optimization SmartApp Compiler (run-time) OS Application OS (modular) Input Data Architecture (reconfigurable) Application Control Instance-specific optimization Compiler + OS + Architecture + Data + Feedback STAPL Application SmartApps SmartApps Architecture DataBase Static STAPL Compiler Predictor & Augmented with Optimizer runtime techniques Compiled code + runtime hooks Smart Application Get Runtime Information (Sample input, system information, etc.) Compute Optimal Application and RTS + OS Configuration advanced stages development stage Toolbox Large adaptation (failure, phase change) Recompute Application Configurer and/or Reconfigure RTS + OS Adaptive Software Adaptive RTS+ OS Runtime tuning (w/o recompile) Predictor & Evaluator Execute Application Continuously monitor performance and adapt as necessary Small adaptation (tuning) Predictor & Evaluator Predictor & Optimizer SmartApps SmartApps written in STAPL • STAPL (Standard Template Adaptive Parallel Library): – Collection of generic parallel algorithms, distributed containers & run-time system (RTS) – Inter-operable with Sequential Programs – Extensible, Composable by end-user – Shared Object View: No explicit communication – Distributed Objects: no replication/coherence – High Productivity Environment SmartApps The STAPL Programming Environment User Code pAlgorithms pContainers pRange RTS + Communication Library (ARMI) Interface to OS (K42) OpenMP/MPI/pthreads/native SmartApps SmartApps to RTS to OS Specialized Services from Generic OS Services – OS offers one size fits all services. – IBM K42 offers customizable services – We want customized services BUT…. we do not want to write them Interface between SmartApps(RTS) & OS(k42) • Vertical integration of Scheduling/Memory Management SmartApps Collaborative Effort: • STAPL (Amato/Rauchwerger) • STAPL Compiler (Rauchwerger/Stroustrup/Quinlan) • RTS – K42 Interface & Optimizations (Krieger/Rauchwerger) • Applications (Amato/Adams/ others) • Validation on DOE extreme HW BlueGene (Moreira) , possibly PERCS (Krieger/Sarkar) Texas A&M (Parasol, NE) + IBM + LLNL ZeptoOS Studying Petascale Operating Systems with Linux Argonne National Laboratory Pete Beckman Bill Gropp Rusty Lusk Susan Coghlan Suravee Suthikulpanit University of Oregon Al Malony Sameer Shende ZeptoOs Observations: • Extremely large systems run an “OS Suite” – BG/L and Red Storm both have at least 4 different operating system flavors • Functional Decomposition trend lends itself toward a customized, optimized point-solution OS • Hierarchical Organization requires software to manage topology, call forwarding, and collective operations QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. ZeptoOs ZeptoOS • Investigating 4 key areas: – Linux as an ultra-lightweight kernel • Memory mgmt, scheduling efficiency, network – Collective OS calls • Explicit collective behavior may be key (DLLs?) – OS Performance monitoring for hierarchical systems – Fault tolerance ZeptoOs Linux as a Lightweight Kernel What does an OS steal from a selfish CPU application? • • Purpose: Micro benchmark measuring CPU cycles provided to benchmark application Helps understand “MPI-reduce problem” and gang scheduling issues ZeptoOs Collective OS Calls • Collective messaging passing calls have been very efficiently implemented on many architectures • Collective I/O calls permit scalable, efficient (non-Posix) file I/O • Collective OS calls, such as dynamically loading libraries, may provide scalable OS functionality ZeptoOs Scalable OS Performance Monitoring (U of Oregon) • TAU provides a framework for scalable performance analysis • Integration of TAU into hierarchical systems, such as BG/L, will all us to explore: – Instrumentation of light-weight kernels • Call forwarding, memory, etc – Intermediate, parallel aggregation of performance data at I/O nodes – Integration of data from the OS Suite ZeptoOs Exploring Faults: Faulty Towers Memory • • • Dial-a-Disaster MPI/Net Kernel Disk Middleware Modify Linux so we can selectively and predictably break things Run user code, middleware, etc at ultra scale, with faults Explore metrics for codes with good “survivability” It’s not a bug, it’s a feature! Simple Counts • OSes (4): Linux (6.5), K-42 (2), Custom (1), Plan 9 (.5) • Labs (7): ANL, LANL, ORNL, LBNL, LLNL, PNNL, SNL • Universities: Caltech, Louisiana Tech, NCSU, Rice, Ohio State, Texas A&M, Toronto, UIUC, UTEP, UNM, U of Chicago, U of Oregon, U of Wisconsin • Industry: Bell Labs, Cray, HP, IBM, Intel, CFS (Lustre), Quadrics, SGI Apple Pie • Open source • Partnerships: Labs, universities, and industry • Scope: basic research, applied research, development, prototypes, testbed systems, and deployment • Structure: “don’t choose a winner too early” – Current or near-term problems -- commonly used, open-source Oses (e.g., Linux or FreeBSD) – Prototyping work in K42 and Plan 9 – At least one wacko project (explore novel ideas that don’t fit into an existing framework) A bit more interesting • Virtualization – Colony • Adaptability – DAiSES, K42, MOLAR, SmartApps – Config, RWK • Usage model & system mgmt (OS Suites) – Colony, Config, MOLAR, Peta-scale SSI, Zepto • Metrics & Measurement – HPC Challenge (http://icl.cs.utk.edu/hpcc/) – DAiSES, K42, MOLAR, RWK, Zepto • Fault handling – Colony, MOLAR, Scalable FT, Zepto continued • Managing the memory hierarchy • Security • Common API – K42, Linux • Single System Image – Peta-scale SSI • Collective Runtime – Zepto • I/O – Peta-scale SSI • OS Noise – Colony, Peta-scale SSI, RWK, Zepto Application Driven • Meet the application developers – OS presentations – Apps people panic -- what are you doing to my machine? – OS people tell ‘em what we heard – Apps people tell us what we didn’t understand

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download FAST-OS BOF SC 04 - Department of Computer Science