Download Toward Highly Available, Self-Healing, Adaptable, Grid

Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin University Geelong, Vic 3217, Australia 1 A Need for More than Execution Performance Performance is a critical assessment criterion  Security, reliability, and ease of programming are neglected  Furthermore  – Parallel computers are seen as being user unfriendly – Parallel processing is not used on daily basis – Ordinary users have to be involved in programming activities that are of the operating system nature – Ordinary engineers, managers, etc do not have, and should not have, specialized knowledge needed to program operating system oriented activities June 2004 COSET’2004 2 Aim of Our Research  IBM has launched a comprehensive program – “to re-examine an obsession with faster, smaller, and more powerful” – “to look at the evolution of computing from a more holistic perspective” IBM’s Autonomic Computing - one of the Grand Challenges  Parallel processing on non-dedicated clusters could benefit from the Autonomic Computing vision  Aim: to show a general design of services and initial implementation of a system that moves parallel processing on clusters to the computing mainstream using the Autonomic Computing vision  June 2004 COSET’2004 3 IBM’s Autonomic Computing  The name “autonomic” has not caught on everywhere, if only because it’s IBM’s – Microsoft – “trustworthy” – Others prefer more generic – “self-managing”  Many see “autonomic computing” as one of the basic parts of a revolutionary technology that – Will start the new .com boom – Will move parallel computing on clusters to the Computing mainstream June 2004 COSET’2004 4 IBM’s Autonomic Computing  Characteristics of autonomic computing systems – knows itself – configures and reconfigures itself under varying and unpredictable conditions – optimizes its working – performs something akin to healing – provides self-protection – knows its surrounding environment – exists in an open (non-hermetic) environment – anticipates the optimized resources needed while keeping its complexity hidden June 2004 COSET’2004 5 Related Work  A number of projects related to Autonomous Computing are mentioned by the IBM website While many of the reported projects engage in some aspects of Autonomic Computing none engage in research to develop a system that has all eight of the characteristics required  None of the projects addresses parallel processing, in particular parallel processing on non-dedicated clusters.  June 2004 COSET’2004 6 Design of Autonomic Elements (Services) Providing Autonomic Computing on Non-dedicated Clusters We have proposed and designed a set of autonomic elements that must be provided to develop an autonomic computing environment on a non-dedicated cluster  Three component levels  – Services – Computers – Non-dedicated cluster  Note: we have not addressed – Hardware aspects – Administration aspects June 2004 COSET’2004 7 Cluster Knows Itself A need for resource discovery  This autonomic element runs on each computer  Activities  – Acquires knowledge of static parameters of computers  processor type (e.g., speed)  memory size  available software – Acquires knowledge of dynamic parameters of clusters  computers’ load  available memory  communication pattern and volume June 2004 COSET’2004 8 Resource Discovery Service Design Computer i Resource Discovery Computational Load & Parameters Communication Pattern & Load Local Communication Load CPU Main Memory Computation element 1 Computation element 2 Remote Communication Load Computer j Resource Discovery CPU June 2004 Main Memory Computation element 1 COSET’2004 Computation element 2 9 Cluster Configures and Reconfigures Itself under Varying and Unpredictable Conditions  In a non-dedicated cluster there are times when – Some computers are lightly loaded or idle – Some computers cannot be used  owners removed them from a shared pool of resources  are heavy loaded  To offer high availability, i.e., to configure and reconfigure itself, the system – Forms parallel virtual clusters adaptively and dynamically – Forming is based on load and changing resources June 2004 COSET’2004 10 Availability Service Design Availability Services Virtual Parallel Cluster (t1) Virtual Parallel Cluster (t0) RD RD RD RD RD RD RD RD Virtual Parallel Cluster (t3) June 2004 Virtual Parallel Cluster (t2) Where times t0< t1< t2< t3 COSET’2004 11 Cluster Should Optimize Its Working Application computation elements should be placed optimally  To improve performance there is a need for  – Computation load – Available memory – Communication costs  To optimize cluster’s working there is – Static allocation and load balancing – Ability to change performance indices that reflect user objectives – Computation element migration, creation and duplication – Setting of computation priorities of applications June 2004 COSET’2004 12 High Performance Service Design Global Scheduler Static Allocation { where: P1 → C1, P2 → C2, ……… {Pi, Pj} → Cn } Load Balancing {where, which, when: Pi : Cn → C3} Availability Services C1 P2 P1 Virtual Parallel Cluster C3 C2 Migration Pi Pj Cn June 2004 COSET’2004 13 Cluster Should Perform Something Akin To Healing Hardware and software faults can occur  Failures lead to the termination of computations  To provide something akin to healing  – Faults are identified and reported – Checkpointing of parallel computation element of applications is provided – Recovery from failures is employed – Migrating applications from faulty computers to healthy computers is carried out automatically – Redundant/replicated services are provided June 2004 COSET’2004 14 Self-Healing Service Design Checkpointing (coordinated) C1 C2 Checkpoint for Compute Elem i Computation Element i Checkpoint for Computation Element Disk June 2004 Cj Checkpoint for Compute Elem i Ck i Recovery COSET’2004 Compute Elem i after crash recovery 15 Clusters Should Provide SelfProtection Computation elements of parallel applications are distributed  Computation elements communicate using messages  They are the subject of passive and active attacks  To provide self-protection:  – Virus detection and recovery must be offered – Resource protection should be a mandatory service – Encryption, as a countermeasure against passive attacks, should be used – Authentication, as a countermeasure against active attacks, should be used June 2004 COSET’2004 16 To Allow a System to Know Its Surrounding Environment and to Prevent a System From Existing in a Hermetic Environment  There are applications that require – More computation power – Specialized software – Unique peripheral devices etc Many owners cannot afford such resources  Some owners can offer their services and resources to appropriate users  June 2004 COSET’2004 17 To Allow a System to Know Its Surrounding Environment and to Prevent a System From Existing in a Hermetic Environment  To benefit from existing unique resources – – – – – – – Resource discovery of other clusters is provided Advertising services is in place Systems are able to cooperate Negotiation is in use Brokerage of resources and services are used Resources are shared in a distributed manner “The move toward a grid” should be in place June 2004 COSET’2004 18 Grid-like Service Design Cluster 1 Cluster 2 Advertisement Computational Services Brokerage Services Brokerage Servicess Exporting Services Storage/Memory Services Printer Services Cluster 3 Cluster n Brokerage Servicess June 2004 Withdrawal Services Information Services Import Requests COSET’2004 Brokerage Servicess 19 A Cluster Should Anticipate the Optimized Resources Needed While Keeping Its Complexity Hidden The scarcity of software to assist ordinary programmers limits the harnessing of the computing power of non-dedicated clusters  This implies  – A programming environment simple to use – Knowledge of resource distribution not needed – Message passing and shared memory programming supported transparently June 2004 COSET’2004 20 Easy Programming Service Design Message Passing or PVM / MPI Communication Primitives Programming Environment Shared Memory DSM System Services of an Operating System Kernel Services of an Operating System June 2004 COSET’2004 21 The Holos Services for Autonomic Computing Clusters  Holos is built to demonstrate that it is possible to develop an autonomic non-dedicated cluster that – could be routinely employed by ordinary engineers, managers, etc – able to support next generation application software executing on clusters   We followed the IBM’s vision recommendations regarding autonomic elements We decided to view autonomic elements as processes – Each computer is a multi-process systems with its objectives – A cluster is a set of multi-process systems with its objectives June 2004 COSET’2004 22 Holos  MP / PVM / MPI Process Brokerage Server DSM Process Parallel Processes  Global Scheduler Execution Server Migration Server System Servers Checkpoint Server Resource Discovery Server DSM Server – – – –  IPC Server Process Manage Server Space Manage Server Kernel Servers  GENESIS Microkernel  June 2004 COSET’2004 – – – Holos was developed based on the P2P and microkernel paradigms The microkernel provides services such as local IPC basic paging operations interrupt handling context switching Three groups of processes: kernel servers system servers application processes Kernel and system servers are stationary, application processes are mobile All processes communicate using messages 23 System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters Resource Discovery Server - collects data about computation and communication load  Availability Server - dynamically and adaptively forms a parallel virtual cluster for the application  Global Scheduling Server – maps application processes using static allocation and dynamic load balancing on the computers of the virtual parallel cluster  June 2004 COSET’2004 24 System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters Execution Server - coordinates the single, multiple and group creation and duplication of application processes on both local and remote computers  Migration Server - coordinates moving application processes to other computers  DSM Server - hides the distributed nature of the cluster’s memory and allows writing code as though using physically shared memory  June 2004 COSET’2004 25 System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters Checkpoint Server - coordinates creation of checkpoints for an executing application  Fault Recovery Server – recovers application processes / applications using checkpoints  IAC Server - supports remote interprocess communication and supports group communication within sets of application processes  Brokerage Server – supports advertising and sharing services through service exporting, importing and revoking  June 2004 COSET’2004 26 Holos Possesses the Autonomic Computing Characteristics Autonomic Computing Requirement Cooperating Holos Servers –Relationships Among Autonomic Elements To allow a system to know itself Resource Discovery Server A system must configure and reconfigure itself under varying and unpredictable conditions Resource Discover Server, Global Scheduling Server, Migration Server, Execution Server, and Availability Server A system must optimize its working Global Scheduling Server, Migration Server, and Execution Server A system must perform something akin to healing Checkpoint Server, Recovery Server, Migration Server, Global Scheduling Server A system must provide self-protection Capabilities in the form of System Names A system must know its surrounding environment Resource Discovery Server, and Brokerage Server A system cannot exist in a hermetic environment Interprocess Communication Server, and Brokerage Server A system must anticipate the optimized resources needed while keeping its complexity hidden (most critical for the user) DSM Server, and Execution Server, DSM Programming Environment, Message Passing Programming Environment, PVM/MPI Programming Environment June 2004 COSET’2004 27 Conclusion  Autonomic computing has been shown to be a basic part of a revolutionary technology that – Could move parallel computing on non-dedicated clusters to the computing mainstream – (Will start the new .com boom – is to be shown) The development of the Holos cluster operating system demonstrates that it is possible to build an autonomic non-dedicated cluster  The Holos cluster operating system has been built from scratch  June 2004 COSET’2004 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Toward Highly Available, Self-Healing, Adaptable, Grid