Download Cluster Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

Distributed operating system wikipedia , lookup

Zero-configuration networking wikipedia , lookup

CAN bus wikipedia , lookup

Lag wikipedia , lookup

Remote Desktop Services wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
CS514: Intermediate
Course in Computer
Systems
Lecture 8: Sept. 24, 2003
Cluster Computing
Guest lecture from Werner Vogels
Today:
Cluster Computing
Werner Vogels
Dept. of Computer Science
Cornell University
1
Agenda
CS514
|
|
|
|
|
History of Cluster Computing
Major Cluster Systems
Technical Challenges
Software Architectures
Cluster Management Systems
z
z
|
MSCS
Galaxy
Student Research Projects in Cluster
Computing
What do I want you to
know?
|
|
|
|
|
CS514
Why and how clusters are used?
What is the difference between parallel and
enterprise cluster computing?
What are the major issues in hardware and
software?
What is a Cluster Management System?
What do I need to do to work on a cluster
computing project myself?
2
How to get more done …
CS514
|
|
|
Work Harder
Work Faster
Get Help
|
|
|
Processor Speed
Algorithms
Parallel processing
Some History
CS514
|
|
|
|
|
Von Neuman thought parallelism to be
impossible
ILLIAC IV – first massive parallel machine
(Illinois ’60)
Japan’s 5th Generation Project
USA – Grand Challenges
Commercial: NCR, IBM Fijustu, Intel SSD,
Gray, Convex
3
Traditional Users
CS514
Scientists – investigate the unknown
| Engineers – simulations
| Retailers – data mining
| Airlines – how to overbook
| Financial – gaining 0.1% advantage
| ……
|
The collapse of the
Supercomputing Industry
CS514
’97 the industry icon Cray Research
went bankrupt.
| Many reasons were given, among
which the end of the cold war
|
4
The Real Reasons
CS514
|
Microprocessors got fast, a lot faster
|
High Availability became a mass
market.
Some early examples
CS514
|
Vax-Clusters / OpenVMS Cluster
5
Tandem Himalaya
CS514
|
Traditional for the HA – market
IBM SYSPLEX
CS514
6
Cluster Definition
CS514
|
Consists of a collection of
interconnected whole computers
|
Is used as a single, unified computing
resource
Distinction from Other
Systems
|
|
|
|
CS514
Scaling: Adding a head or a
whole dog
Availability: what if a dog
breaks a leg?
System management:
walking the dog
Software licensing: dog tax
7
Technical Challenges - I
CS514
|
|
|
|
Cluster Hardware (NOW, rack&stack, NUMAs)
Cluster Communication (Interconnects,
Communication Protocols)
Cluster System Middleware (management, availability,
tools)
High-performance IO systems (storage, file systems,
data placement and movement)
Technical Challenges - II
CS514
|
|
|
|
|
Job and Resource Management
Programming Environments (Distr. Objects, Message
Passing)
Scalable Services.
Business frameworks (multi tiers, web based, decision
support)
Applications (Scientific, High-Availability, Scalable
performance)
8
Single System Image
CS514
|
From the perspective of
User
z Network
z Application
z Administrator
z
|
Key Issues:
Each SSI has a boundary
z SSI support can exist at different
levels
z
Single System Image - II
CS514
|
Boundary:
Inside a single machine
z Outside a collection of machines
z
|
SSI Levels
Application
z Middleware
z Operating System
z Hardware
z
9
Is Transparency a good
thing?
CS514
Yes, but achieving it is close to
impossible
| Many transparencies were introduced
in legacy code with disastrous side
effects
| User to cluster is possible
| Server side should be avoided
|
1997 M icrosoft.C om W eb site (4 farm s).
B uilding 11
Sta gin g S ervers
(7)
Intern al W W W
Router
SQ L Reportin g
Live SQ L S erver
ww w.m icros oft.com
(4)
prem
E uropean
Dium
ata.mCicr
e osoft.com
nter
ww w.m icros oft.com (1)
(3)
SQ LN et
SQ L SER VER S
F eeder LAN
SQ
L
C
ons
olidators
(2)
R outerD M Z Stagin g S er vers
F TP
Live SQ L S ervers
Dow nload S erver
(1)
M O SW est
Sw
itched
Adm in LAN
Ethern et
M OSW est
F TP S ervers
D ownload
Replic ation
sn.c om
register.m icros oft.com w ww .m icrosoft.com m sid.m
(1)
(2)
(4)
search.m icros oft.com
(3)
hom e.m icros oft.com
hom e.m icr os oft.com
F D DI R ing
(3)
(4)
(M IS2)
prem ium .m icros oft.com
(2)
activex.m icros oft.com
(2)
F D DI R ing
(M IS1)
cdm .m icr os oft.com
(1)
Router
m sid.m sn.com
(1)
Router
R outer
R outer
pr em ium .m icros oft.com Router
ww w .m icros oft.com (1)
R outer
(3)
F D DI R ing
hom e.m icros oft.com (M IS3) m sid.m sn.c om
(2)
(1)
R outer
R outer
F TP.m icros oft.com
(3)
Prim ary
G igaswitch
Sec on dary
G igaswitch
w ww .m icrosoft.com
(5)
0
register.m icr osoft.com
F DD I Ring
(2)
(M IS4)
register.m icr os oft.com
hom e.m icros oft.com
support.m icr os oft.com
(1)
(5)
register.m sn.com
(2)
(2)
support.m icr os oft.com
search.m icr os oft.com(1)
(3)
CS514
search.m icr os oft.com
(1)
m sid.m sn.c om
(1)
Router
Japan D ata Cww
e nter
SQ L SER VER S
w.m icros oft.com
prem ium .m icros oft.com(3)
(1)
(2)
m sid.m sn.c om
(1)
Sw itched
Ethern et
F TP
D ow nload S erver
(1)
search.m icr os oft.com
H T TP
Dow nload S ervers
(2)
(2)
R outer
2
OC 3
(45M b/Sec Eac h)
2
Ethernet
(100 M b/Sec Eac h)
13
D S3
(45 M b/Sec Eac h)
Internet
10
Two main Cluster Types
CS514
|
RACS – Reliable Array of Cloned Services
Shared Nothing Clones
|
Shared Disk Clones
RAPS – Reliable Array of Partioned Services
Partitions
Packed Partitions
Taxonomy
CS514
Farm
Partition
Clone
Pack
Shared
Nothing
Shared
Disk
Shared
Nothing
ActiveActive
ActivePassive
11
Sample E-Commerce Farm
The FARM: Clones and Packs of Partitions
CS514
Packed Partitions: Database Transparency
SQL Partition 3
Cloned
Packed
file
servers
Web File
SQL Partition 2
Web File StoreA
Web
Clients
SQL
SQLPartition1
Database
SQL Temp State
Load Balance
Windows NT Clusters MSCS
|
|
|
|
|
|
|
CS514
Group of independent systems that appear
as a single system
Managed as a single system
Common namespace
Services are “cluster-wide”
Ability to tolerate component failures
Components can be added transparently to
users
Existing client connectivity is not affected by
clustered applications
12
MSCS Features
CS514
|
Shared nothing
z
|
|
Remoteable tools
Windows NT manageability enhancements
z
|
|
Simplified hardware configuration
Never take a “cluster” down: shell game
rolling upgrade
Microsoft® BackOffice™ product support
Provide clustering solutions for all levels
of customer requirements
z
Eliminate cost and complexity barriers
Non-Features Of MSCS
CS514
|
|
Not lock-step/fault-tolerant
Not able to “move” running applications
z
|
“MSCS” restarts applications that are failed over to
other cluster members
Not able to recover shared state between client and
server (i.e., file position)
z
z
All client/server transactions should
be atomic
Standard client/server development
rules still apply
• Atomic Consistent Isolated Durable (ACID) always wins
13
Basic MSCS Terms
CS514
|
Quorum Resource
z
z
Usually (but not necessarily) a SCSI disk
Requirements:
• Arbitrates for a resource by supporting the
challenge/defense protocol
• Capable of storing cluster registry and logs
z
Used to Persist Configuration Change Logs
• Tracks changes to configuration database when
any defined member missing (not active)
• Prevents configuration partitions in time
• “Temporal Partitions”
“MSCS” Cluster
Client PCs
CS514
Server A
Server B
Heartbeat
Disk cabinet A
Cluster management
Disk cabinet B
14
“MSCS” Namespace
Cluster view
CS514
Cluster name
Node name
Node name
Virtual
server name
Virtual
server name
Virtual
server name
Virtual
server name
MSCS Namespace
Outside world view
CS514
Cluster
Node 1
Node 2
Virtual
server 1
Internet
Information
Server
SQL
IP address:
1.1.1.1
Network
name:
WHECCLUS
IP address:
1.1.1.2
Network
name:
WHECNode1
IP address:
1.1.1.3
Network
name:
WHECNode2
IP address:
1.1.1.4
Network
name:
WHECWHEC-VS1
Virtual
server 2
Virtual
server 3
MTS
“Falcon”
Microsoft
Exchange
IP address:
1.1.1.5
Network
name:
WHECWHEC-VS2
IP address:
1.1.1.6
Network
name:
WHECWHEC-VS3
15
MSCS Architecture
Cluster
API
Cluster administrator
Cluster API DLL
Cluster.ExeCS514
Cluster API DLL
Cluster API stub
Log
Manager
Database
Manager
Event
Processor
Checkpoint
Manager
Failover
Manager
Application
resource DLL
Global
Update
Manager
Object
Manager
Resource
Manager
Resource
monitors
Membership
Manager
Node
Manager
Resource
API
Physical
Logical
Application
resource DLL resource DLL resource DLL
Reliable Cluster
Transport + Heartbeat
Network
MSCS Architecture
|
Cluster service is comprised
of the following objects
z
z
z
z
z
z
z
z
z
CS514
Failover Manager (FM)
Resource Manager (RM)
Node Manager (NM)
Membership Manager (MM)
Event Processor (EVT)
Database Manager (DM)
Object Manager (OM)
Global Update Manager (GUM)
Checkpoint Manager (CM)
16
Cluster Communications
CS514
|
|
|
Most communication via RPC
UDP used for membership heartbeat messages
Standard (e.g., Ethernet) interconnects
Management
apps
RPC
Cluster
Service
RPC
RPC: admin
UDP: Heartbeat
Cluster
Service
RPC
RPC
Resource
Monitors
Resource
Monitors
Resource
Monitors
Resource
Monitors
MSCS Architecture
Failover Manager
CS514
|
Top tier provides
cluster abstractions
|
Middle tier provides
distributed operations
|
Bottom tier is
Windows NT
and drivers
Resource Monitor
Cluster Registry
Global Update
Quorum
Membership
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
17
Membership And Regroup
CS514
|
Membership:
z
|
Used for orderly addition and removal
from { active nodes }
Regroup:
Used for failure detection (via
heartbeat messages)
z Forceful eviction from
{ active nodes }
z
Membership
CS514
|
|
Defined cluster = all nodes
Active cluster:
z
z
Subset of defined cluster
Includes Quorum Resource
• Transitive Ownership
z
Stable (no regroup
in progress)
18
Challenge/Defense Protocol
CS514
|
|
|
|
SCSI-2 has reserve/release verbs
z Semaphore on disk controller
Owner gets lease on semaphore
Renews lease once every 3 seconds
To preempt ownership:
z Challenger clears semaphore (SCSI bus reset)
z Waits 10 seconds
•
•
z
z
3 seconds for renewal + 2 seconds bus settle time
x2 to give owner two chances to renew
If still clear, then former owner loses lease
Challenger issues reserve to acquire semaphore
Challenge/Defense Protocol:
Successful Defense
CS514
Defender Node
Reserve Reserve
0
1
2
3
4
Reserve Reserve Reserve
5
6
7
Bus Reset
8
9
10 11 12 13 14 15 16
Reservation
detected
Challenger Node
19
Challenge/Defense Protocol:
Successful Challenge
CS514
Defender Node
Reserve
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Bus Reset
Challenger Node
No
reservation
detected
Reserve
Regroup
CS514
|
|
|
|
Invariant:
All members agree on {members}
Regroup re-computes {members}
Each node sends heartbeat message to a peer (default is
one per second)
Regroup if two lost heartbeat messages
z
z
|
Uses a 5-round protocol to agree
z
z
|
Suspicion that sender is dead
Failure detection in bounded time
Checks communication among nodes
Suspected missing node may survive
Upper levels (global update, etc.) informed of regroup event
20
Membership State Machine
Initialize
Sleeping
Start Cluster
Search Fails
Member
Search
Minority or
no Quorum
Found
Online
Member
Joining
Search or
Reserve Fails
Quorum
Disk Search
Acquire (reserve)
Quorum Disk
Regroup
NonNon-Minority
and Quorum
Lost
Heartbeat
Join
Succeeds
CS514
Forming
Synchronize
Succeeds
Online
Joining A Cluster
CS514
|
|
|
When a node starts up, it mounts and configures only
local, non-cluster devices
Starts Cluster Service which
z Looks in local (stale) registry for members
z Asks each member in turn to sponsor new node’s
membership. (Stop when sponsor found.)
Sponsor (any active member)
z Sponsor authenticates applicant
z Broadcasts applicant to cluster members
z Sponsor sends updated registry to applicant
z Applicant becomes a cluster member
21
Forming A Cluster
(When Joining Fails)
|
|
|
Use registry to find quorum resource
Attach to (arbitrate for) quorum resource
Update cluster registry
from quorum resource
E.g., if we were down when it was in use
z
|
|
|
Form new one-node cluster
Bring other cluster resources online
Let others join your cluster
Leaving A Cluster
(Gracefully)
|
|
CS514
CS514
Pause:
z Move all groups off this member.
z Change to paused state (remains a cluster member)
Offline:
z Move all groups off this member.
z Sends ClusterExit message all cluster members
•
•
Prevents regroup
Prevents stalls during departure transitions
Close Cluster connections
(now not an active cluster member)
z Cluster service stops on node
Evict: remove node from defined member list
z
|
22
Leaving A Cluster
(Node Failure)
|
|
z
z
|
Minority group OR no quorum device:
• Group does NOT survive
Non-minority group AND quorum device:
• Group DOES survive
Non-Minority rule:
z
z
|
CS514
Node (or communication) failure triggers Regroup
If after regroup:
Number of new members >= 1/2 old active cluster
Prevents minority from seizing quorum device at the expense
of a larger potentially surviving cluster
Quorum guarantees correctness
z
Prevents “split-brain”
• E.g., with newly forming cluster containing a single node
Global Update
CS514
|
|
|
|
|
|
Propagates updates to all nodes in cluster
Used to maintain replicated cluster registry
Updates are atomic
and totally ordered
Tolerates all benign failures.
Depends on membership
z All are up
z All can communicate
R. Carr, Tandem Systems Review. V1.2 1985,
sketches regroup and global
update protocol
23
Global Update Algorithm
|
CS514
ac
k
|
L
Cluster has locker node that
regulates updates
z Oldest active node in cluster
Send Update to locker node
Update other (active) nodes
z In seniority order (e.g. locker first)
z This includes the updating node
Failure of all updated nodes: S
z Update never happened
z Updated nodes will roll back on recovery
Survival of any updated nodes:
z New locker is oldest and so has
update if any do
z New locker restarts update
X=
10
0!
|
|
|
Cluster Registry
CS514
|
|
Separate from local Windows NT Registry
Maintains cluster configuration
z
|
|
Members, resources, restart parameters,
etc.
Stable storage
Replicated at
each member
z
z
Global Update protocol
Windows NT Registry keeps local copy
24
Cluster Registry Bootstrapping
CS514
|
Membership uses Cluster Registry for
list of nodes
z
|
…Circular dependency
Solution:
z
z
z
Membership uses stale local cluster registry
Refresh after joining or forming cluster
Master is either
•
•
Quorum device, or
Active members
Resource Monitor
CS514
|
Polls resources:
z
|
Detects failures
z
z
|
IsAlive and LooksAlive
polling failure
failure event
from resource
Higher levels tell it
z
z
Online, Offline
Restart
25
Failover Manager
Failover Manager
CS514
|
Assigns groups to nodes based on
z
z
z
Resource Monitor
Failover parameters Cluster Registry
Possible nodes
Global Update
for each resource
Membership
in group
Regroup
Preferred nodes for resource group
Windows NT Server
Cluster
Disk Driver
Cluster
Net Drivers
Resource
CS514
|
Fails over (moves) from one machine to
another
z
z
z
z
|
|
|
Logical disk
IP address
Server application
Database
Online at only one machine
May depend on another resource
Well-defined properties controlling its
behavior
26
Resource Properties
CS514
|
|
Resource type
Poll intervals
z
z
|
Looksalive
Isalive
Private resource
data
z
z
Group
membership
| Possible nodes
| Restart policy
| Dependencies
|
Unique identifier
Hardware
binding
Time
CS514
|
|
|
|
Time must
increase monotonically
z Otherwise applications
get confused
z E.g., make/nmake/build
Time is maintained within failover resolution
z Not hard, since failover on order of seconds
Time is a resource, so one node
owns time resource
Other nodes periodically correct
drift from owner’s time
27
Win 2000 MSCS Cluster
Client PCs
Server B
Server A
Heartbeat
Disk cabinet 1
Cluster management
Disk cabinet 2
Clients
Clients
Server C
Server D
Application Evolution
MSCS Version NT 4.0
CS514
Application
Microsoft SQL Server
Node 1 Node 2
;
;
Microsoft Transaction
Server (MTS)
Internet Information
Server (IIS)
Microsoft Exchange
Server
;
;
28
Application Evolution
MSCS Version Windows 2000
CS514
Application
Node 1 Node 2 Node 3 Node 4
;
;
;
;
;
;
;
;
Internet Information
Server (IIS)
;
;
;
;
Microsoft Exchange
Server
;
;
;
;
Microsoft SQL Server
Microsoft Transaction
Server (MTS)
29