Download Toward Highly Available, Self-Healing, Adaptable, Grid

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Plan 9 from Bell Labs wikipedia , lookup

Distributed operating system wikipedia , lookup

Spring (operating system) wikipedia , lookup

Transcript
Cluster Operating System
Support For
Parallel Autonomic Computing
Andrzej M. Goscinski, J. Silcock, M. Hobbs
School of Information Technology
Deakin University
Geelong, Vic 3217, Australia
1
A Need for More than Execution
Performance
Performance is a critical assessment criterion
 Security, reliability, and ease of programming are
neglected
 Furthermore

– Parallel computers are seen as being user unfriendly
– Parallel processing is not used on daily basis
– Ordinary users have to be involved in programming
activities that are of the operating system nature
– Ordinary engineers, managers, etc do not have, and
should not have, specialized knowledge needed to
program operating system oriented activities
June 2004
COSET’2004
2
Aim of Our Research

IBM has launched a comprehensive program
– “to re-examine an obsession with faster, smaller, and
more powerful”
– “to look at the evolution of computing from a more
holistic perspective”
IBM’s Autonomic Computing - one of the Grand
Challenges
 Parallel processing on non-dedicated clusters could
benefit from the Autonomic Computing vision
 Aim: to show a general design of services and
initial implementation of a system that moves
parallel processing on clusters to the computing
mainstream using the Autonomic Computing vision

June 2004
COSET’2004
3
IBM’s Autonomic Computing

The name “autonomic” has not caught on
everywhere, if only because it’s IBM’s
– Microsoft – “trustworthy”
– Others prefer more generic – “self-managing”

Many see “autonomic computing” as one of the
basic parts of a revolutionary technology that
– Will start the new .com boom
– Will move parallel computing on clusters to the
Computing mainstream
June 2004
COSET’2004
4
IBM’s Autonomic Computing

Characteristics of autonomic computing systems
– knows itself
– configures and reconfigures itself under varying and
unpredictable conditions
– optimizes its working
– performs something akin to healing
– provides self-protection
– knows its surrounding environment
– exists in an open (non-hermetic) environment
– anticipates the optimized resources needed while
keeping its complexity hidden
June 2004
COSET’2004
5
Related Work

A number of projects related to Autonomous
Computing are mentioned by the IBM website
While many of the reported projects engage in
some aspects of Autonomic Computing none
engage in research to develop a system that has
all eight of the characteristics required
 None of the projects addresses parallel
processing, in particular parallel processing on
non-dedicated clusters.

June 2004
COSET’2004
6
Design of Autonomic Elements
(Services) Providing Autonomic
Computing on Non-dedicated Clusters
We have proposed and designed a set of
autonomic elements that must be provided to
develop an autonomic computing environment on
a non-dedicated cluster
 Three component levels

– Services
– Computers
– Non-dedicated cluster

Note: we have not addressed
– Hardware aspects
– Administration aspects
June 2004
COSET’2004
7
Cluster Knows Itself
A need for resource discovery
 This autonomic element runs on each computer
 Activities

– Acquires knowledge of static parameters of computers
 processor type (e.g., speed)
 memory size
 available software
– Acquires knowledge of dynamic parameters of clusters
 computers’ load
 available memory
 communication pattern and volume
June 2004
COSET’2004
8
Resource Discovery Service Design
Computer i
Resource
Discovery
Computational Load
& Parameters
Communication
Pattern & Load
Local Communication Load
CPU
Main
Memory
Computation
element
1
Computation
element
2
Remote
Communication Load
Computer j
Resource
Discovery
CPU
June 2004
Main
Memory
Computation
element
1
COSET’2004
Computation
element
2
9
Cluster Configures and
Reconfigures Itself under Varying
and Unpredictable Conditions

In a non-dedicated cluster there are times when
– Some computers are lightly loaded or idle
– Some computers cannot be used
 owners removed them from a shared pool of resources
 are heavy loaded

To offer high availability, i.e., to configure and
reconfigure itself, the system
– Forms parallel virtual clusters adaptively and dynamically
– Forming is based on load and changing resources
June 2004
COSET’2004
10
Availability Service Design
Availability
Services
Virtual Parallel Cluster (t1)
Virtual Parallel Cluster (t0)
RD
RD
RD
RD
RD
RD
RD
RD
Virtual Parallel Cluster (t3)
June 2004
Virtual Parallel Cluster (t2)
Where times t0< t1< t2< t3
COSET’2004
11
Cluster Should Optimize Its Working
Application computation elements should be placed
optimally
 To improve performance there is a need for

– Computation load
– Available memory
– Communication costs

To optimize cluster’s working there is
– Static allocation and load balancing
– Ability to change performance indices that reflect user
objectives
– Computation element migration, creation and duplication
– Setting of computation priorities of applications
June 2004
COSET’2004
12
High Performance Service Design
Global Scheduler
Static
Allocation
{ where:
P1 → C1,
P2 → C2,
………
{Pi, Pj} → Cn }
Load
Balancing
{where, which, when: Pi : Cn
→ C3}
Availability
Services
C1
P2
P1
Virtual Parallel
Cluster
C3
C2
Migration
Pi
Pj
Cn
June 2004
COSET’2004
13
Cluster Should Perform Something
Akin To Healing
Hardware and software faults can occur
 Failures lead to the termination of computations
 To provide something akin to healing

– Faults are identified and reported
– Checkpointing of parallel computation element of
applications is provided
– Recovery from failures is employed
– Migrating applications from faulty computers to healthy
computers is carried out automatically
– Redundant/replicated services are provided
June 2004
COSET’2004
14
Self-Healing Service Design
Checkpointing
(coordinated)
C1
C2
Checkpoint
for
Compute Elem i
Computation
Element i
Checkpoint for
Computation Element
Disk
June 2004
Cj
Checkpoint
for
Compute Elem i
Ck
i
Recovery
COSET’2004
Compute Elem i
after crash
recovery
15
Clusters Should Provide SelfProtection
Computation elements of parallel applications are
distributed
 Computation elements communicate using
messages
 They are the subject of passive and active attacks
 To provide self-protection:

– Virus detection and recovery must be offered
– Resource protection should be a mandatory service
– Encryption, as a countermeasure against passive attacks,
should be used
– Authentication, as a countermeasure against active
attacks, should be used
June 2004
COSET’2004
16
To Allow a System to Know Its
Surrounding Environment and to
Prevent a System From Existing in a
Hermetic Environment

There are applications that require
– More computation power
– Specialized software
– Unique peripheral devices etc
Many owners cannot afford such resources
 Some owners can offer their services and resources
to appropriate users

June 2004
COSET’2004
17
To Allow a System to Know Its
Surrounding Environment and to
Prevent a System From Existing in a
Hermetic Environment

To benefit from existing unique resources
–
–
–
–
–
–
–
Resource discovery of other clusters is provided
Advertising services is in place
Systems are able to cooperate
Negotiation is in use
Brokerage of resources and services are used
Resources are shared in a distributed manner
“The move toward a grid” should be in place
June 2004
COSET’2004
18
Grid-like Service Design
Cluster 1
Cluster 2
Advertisement
Computational
Services
Brokerage
Services
Brokerage
Servicess
Exporting
Services
Storage/Memory
Services
Printer
Services
Cluster 3
Cluster n
Brokerage
Servicess
June 2004
Withdrawal
Services
Information
Services
Import
Requests
COSET’2004
Brokerage
Servicess
19
A Cluster Should Anticipate the
Optimized Resources Needed While
Keeping Its Complexity Hidden
The scarcity of software to assist ordinary
programmers limits the harnessing of the computing
power of non-dedicated clusters
 This implies

– A programming environment simple to use
– Knowledge of resource distribution not needed
– Message passing and shared memory programming
supported transparently
June 2004
COSET’2004
20
Easy Programming Service Design
Message
Passing
or PVM / MPI
Communication
Primitives
Programming
Environment
Shared
Memory
DSM
System Services
of an
Operating System
Kernel Services of an
Operating System
June 2004
COSET’2004
21
The Holos Services for Autonomic
Computing Clusters

Holos is built to demonstrate that it is possible to develop an
autonomic non-dedicated cluster that
– could be routinely employed by ordinary engineers, managers, etc
– able to support next generation application software executing on
clusters


We followed the IBM’s vision recommendations regarding
autonomic elements
We decided to view autonomic elements as processes
– Each computer is a multi-process systems with its objectives
– A cluster is a set of multi-process systems with its objectives
June 2004
COSET’2004
22
Holos

MP / PVM
/ MPI
Process
Brokerage
Server
DSM
Process
Parallel
Processes

Global
Scheduler
Execution
Server
Migration
Server
System Servers
Checkpoint
Server
Resource
Discovery
Server
DSM
Server
–
–
–
–

IPC
Server
Process
Manage
Server
Space
Manage
Server
Kernel Servers

GENESIS
Microkernel

June 2004
COSET’2004
–
–
–
Holos was developed based
on the P2P and microkernel
paradigms
The microkernel provides
services such as
local IPC
basic paging operations
interrupt handling
context switching
Three groups of processes:
kernel servers
system servers
application processes
Kernel and system servers
are stationary, application
processes are mobile
All processes communicate
using messages
23
System Servers Form a Basis of an
Autonomic Operating System for
Nondedicated Clusters
Resource Discovery Server - collects data about
computation and communication load
 Availability Server - dynamically and adaptively
forms a parallel virtual cluster for the application
 Global Scheduling Server – maps application
processes using static allocation and dynamic load
balancing on the computers of the virtual parallel
cluster

June 2004
COSET’2004
24
System Servers Form a Basis of an
Autonomic Operating System for
Nondedicated Clusters
Execution Server - coordinates the single, multiple
and group creation and duplication of application
processes on both local and remote computers
 Migration Server - coordinates moving application
processes to other computers
 DSM Server - hides the distributed nature of the
cluster’s memory and allows writing code as
though using physically shared memory

June 2004
COSET’2004
25
System Servers Form a Basis of an
Autonomic Operating System for
Nondedicated Clusters
Checkpoint Server - coordinates creation of
checkpoints for an executing application
 Fault Recovery Server – recovers application
processes / applications using checkpoints
 IAC Server - supports remote interprocess
communication and supports group communication
within sets of application processes
 Brokerage Server – supports advertising and
sharing services through service exporting,
importing and revoking

June 2004
COSET’2004
26
Holos Possesses the Autonomic
Computing Characteristics
Autonomic Computing Requirement
Cooperating Holos Servers –Relationships Among
Autonomic Elements
To allow a system to know itself
Resource Discovery Server
A system must configure and reconfigure itself under varying
and unpredictable conditions
Resource Discover Server, Global Scheduling Server, Migration
Server, Execution Server, and Availability Server
A system must optimize its working
Global Scheduling Server, Migration Server, and Execution Server
A system must perform something akin to healing
Checkpoint Server, Recovery Server, Migration Server, Global
Scheduling Server
A system must provide self-protection
Capabilities in the form of System Names
A system must know its surrounding environment
Resource Discovery Server, and Brokerage Server
A system cannot exist in a hermetic environment
Interprocess Communication Server, and Brokerage Server
A system must anticipate the optimized resources needed while
keeping its complexity hidden (most critical for the user)
DSM Server, and Execution Server, DSM Programming Environment,
Message Passing Programming Environment, PVM/MPI Programming
Environment
June 2004
COSET’2004
27
Conclusion

Autonomic computing has been shown to be a
basic part of a revolutionary technology that
– Could move parallel computing on non-dedicated
clusters to the computing mainstream
– (Will start the new .com boom – is to be shown)
The development of the Holos cluster operating
system demonstrates that it is possible to build
an autonomic non-dedicated cluster
 The Holos cluster operating system has been
built from scratch

June 2004
COSET’2004
28