Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS514: Intermediate Course in Computer Systems Lecture 8: Sept. 24, 2003 Cluster Computing Guest lecture from Werner Vogels Today: Cluster Computing Werner Vogels Dept. of Computer Science Cornell University 1 Agenda CS514 | | | | | History of Cluster Computing Major Cluster Systems Technical Challenges Software Architectures Cluster Management Systems z z | MSCS Galaxy Student Research Projects in Cluster Computing What do I want you to know? | | | | | CS514 Why and how clusters are used? What is the difference between parallel and enterprise cluster computing? What are the major issues in hardware and software? What is a Cluster Management System? What do I need to do to work on a cluster computing project myself? 2 How to get more done … CS514 | | | Work Harder Work Faster Get Help | | | Processor Speed Algorithms Parallel processing Some History CS514 | | | | | Von Neuman thought parallelism to be impossible ILLIAC IV – first massive parallel machine (Illinois ’60) Japan’s 5th Generation Project USA – Grand Challenges Commercial: NCR, IBM Fijustu, Intel SSD, Gray, Convex 3 Traditional Users CS514 Scientists – investigate the unknown | Engineers – simulations | Retailers – data mining | Airlines – how to overbook | Financial – gaining 0.1% advantage | …… | The collapse of the Supercomputing Industry CS514 ’97 the industry icon Cray Research went bankrupt. | Many reasons were given, among which the end of the cold war | 4 The Real Reasons CS514 | Microprocessors got fast, a lot faster | High Availability became a mass market. Some early examples CS514 | Vax-Clusters / OpenVMS Cluster 5 Tandem Himalaya CS514 | Traditional for the HA – market IBM SYSPLEX CS514 6 Cluster Definition CS514 | Consists of a collection of interconnected whole computers | Is used as a single, unified computing resource Distinction from Other Systems | | | | CS514 Scaling: Adding a head or a whole dog Availability: what if a dog breaks a leg? System management: walking the dog Software licensing: dog tax 7 Technical Challenges - I CS514 | | | | Cluster Hardware (NOW, rack&stack, NUMAs) Cluster Communication (Interconnects, Communication Protocols) Cluster System Middleware (management, availability, tools) High-performance IO systems (storage, file systems, data placement and movement) Technical Challenges - II CS514 | | | | | Job and Resource Management Programming Environments (Distr. Objects, Message Passing) Scalable Services. Business frameworks (multi tiers, web based, decision support) Applications (Scientific, High-Availability, Scalable performance) 8 Single System Image CS514 | From the perspective of User z Network z Application z Administrator z | Key Issues: Each SSI has a boundary z SSI support can exist at different levels z Single System Image - II CS514 | Boundary: Inside a single machine z Outside a collection of machines z | SSI Levels Application z Middleware z Operating System z Hardware z 9 Is Transparency a good thing? CS514 Yes, but achieving it is close to impossible | Many transparencies were introduced in legacy code with disastrous side effects | User to cluster is possible | Server side should be avoided | 1997 M icrosoft.C om W eb site (4 farm s). B uilding 11 Sta gin g S ervers (7) Intern al W W W Router SQ L Reportin g Live SQ L S erver ww w.m icros oft.com (4) prem E uropean Dium ata.mCicr e osoft.com nter ww w.m icros oft.com (1) (3) SQ LN et SQ L SER VER S F eeder LAN SQ L C ons olidators (2) R outerD M Z Stagin g S er vers F TP Live SQ L S ervers Dow nload S erver (1) M O SW est Sw itched Adm in LAN Ethern et M OSW est F TP S ervers D ownload Replic ation sn.c om register.m icros oft.com w ww .m icrosoft.com m sid.m (1) (2) (4) search.m icros oft.com (3) hom e.m icros oft.com hom e.m icr os oft.com F D DI R ing (3) (4) (M IS2) prem ium .m icros oft.com (2) activex.m icros oft.com (2) F D DI R ing (M IS1) cdm .m icr os oft.com (1) Router m sid.m sn.com (1) Router R outer R outer pr em ium .m icros oft.com Router ww w .m icros oft.com (1) R outer (3) F D DI R ing hom e.m icros oft.com (M IS3) m sid.m sn.c om (2) (1) R outer R outer F TP.m icros oft.com (3) Prim ary G igaswitch Sec on dary G igaswitch w ww .m icrosoft.com (5) 0 register.m icr osoft.com F DD I Ring (2) (M IS4) register.m icr os oft.com hom e.m icros oft.com support.m icr os oft.com (1) (5) register.m sn.com (2) (2) support.m icr os oft.com search.m icr os oft.com(1) (3) CS514 search.m icr os oft.com (1) m sid.m sn.c om (1) Router Japan D ata Cww e nter SQ L SER VER S w.m icros oft.com prem ium .m icros oft.com(3) (1) (2) m sid.m sn.c om (1) Sw itched Ethern et F TP D ow nload S erver (1) search.m icr os oft.com H T TP Dow nload S ervers (2) (2) R outer 2 OC 3 (45M b/Sec Eac h) 2 Ethernet (100 M b/Sec Eac h) 13 D S3 (45 M b/Sec Eac h) Internet 10 Two main Cluster Types CS514 | RACS – Reliable Array of Cloned Services Shared Nothing Clones | Shared Disk Clones RAPS – Reliable Array of Partioned Services Partitions Packed Partitions Taxonomy CS514 Farm Partition Clone Pack Shared Nothing Shared Disk Shared Nothing ActiveActive ActivePassive 11 Sample E-Commerce Farm The FARM: Clones and Packs of Partitions CS514 Packed Partitions: Database Transparency SQL Partition 3 Cloned Packed file servers Web File SQL Partition 2 Web File StoreA Web Clients SQL SQLPartition1 Database SQL Temp State Load Balance Windows NT Clusters MSCS | | | | | | | CS514 Group of independent systems that appear as a single system Managed as a single system Common namespace Services are “cluster-wide” Ability to tolerate component failures Components can be added transparently to users Existing client connectivity is not affected by clustered applications 12 MSCS Features CS514 | Shared nothing z | | Remoteable tools Windows NT manageability enhancements z | | Simplified hardware configuration Never take a “cluster” down: shell game rolling upgrade Microsoft® BackOffice™ product support Provide clustering solutions for all levels of customer requirements z Eliminate cost and complexity barriers Non-Features Of MSCS CS514 | | Not lock-step/fault-tolerant Not able to “move” running applications z | “MSCS” restarts applications that are failed over to other cluster members Not able to recover shared state between client and server (i.e., file position) z z All client/server transactions should be atomic Standard client/server development rules still apply • Atomic Consistent Isolated Durable (ACID) always wins 13 Basic MSCS Terms CS514 | Quorum Resource z z Usually (but not necessarily) a SCSI disk Requirements: • Arbitrates for a resource by supporting the challenge/defense protocol • Capable of storing cluster registry and logs z Used to Persist Configuration Change Logs • Tracks changes to configuration database when any defined member missing (not active) • Prevents configuration partitions in time • “Temporal Partitions” “MSCS” Cluster Client PCs CS514 Server A Server B Heartbeat Disk cabinet A Cluster management Disk cabinet B 14 “MSCS” Namespace Cluster view CS514 Cluster name Node name Node name Virtual server name Virtual server name Virtual server name Virtual server name MSCS Namespace Outside world view CS514 Cluster Node 1 Node 2 Virtual server 1 Internet Information Server SQL IP address: 1.1.1.1 Network name: WHECCLUS IP address: 1.1.1.2 Network name: WHECNode1 IP address: 1.1.1.3 Network name: WHECNode2 IP address: 1.1.1.4 Network name: WHECWHEC-VS1 Virtual server 2 Virtual server 3 MTS “Falcon” Microsoft Exchange IP address: 1.1.1.5 Network name: WHECWHEC-VS2 IP address: 1.1.1.6 Network name: WHECWHEC-VS3 15 MSCS Architecture Cluster API Cluster administrator Cluster API DLL Cluster.ExeCS514 Cluster API DLL Cluster API stub Log Manager Database Manager Event Processor Checkpoint Manager Failover Manager Application resource DLL Global Update Manager Object Manager Resource Manager Resource monitors Membership Manager Node Manager Resource API Physical Logical Application resource DLL resource DLL resource DLL Reliable Cluster Transport + Heartbeat Network MSCS Architecture | Cluster service is comprised of the following objects z z z z z z z z z CS514 Failover Manager (FM) Resource Manager (RM) Node Manager (NM) Membership Manager (MM) Event Processor (EVT) Database Manager (DM) Object Manager (OM) Global Update Manager (GUM) Checkpoint Manager (CM) 16 Cluster Communications CS514 | | | Most communication via RPC UDP used for membership heartbeat messages Standard (e.g., Ethernet) interconnects Management apps RPC Cluster Service RPC RPC: admin UDP: Heartbeat Cluster Service RPC RPC Resource Monitors Resource Monitors Resource Monitors Resource Monitors MSCS Architecture Failover Manager CS514 | Top tier provides cluster abstractions | Middle tier provides distributed operations | Bottom tier is Windows NT and drivers Resource Monitor Cluster Registry Global Update Quorum Membership Windows NT Server Cluster Disk Driver Cluster Net Drivers 17 Membership And Regroup CS514 | Membership: z | Used for orderly addition and removal from { active nodes } Regroup: Used for failure detection (via heartbeat messages) z Forceful eviction from { active nodes } z Membership CS514 | | Defined cluster = all nodes Active cluster: z z Subset of defined cluster Includes Quorum Resource • Transitive Ownership z Stable (no regroup in progress) 18 Challenge/Defense Protocol CS514 | | | | SCSI-2 has reserve/release verbs z Semaphore on disk controller Owner gets lease on semaphore Renews lease once every 3 seconds To preempt ownership: z Challenger clears semaphore (SCSI bus reset) z Waits 10 seconds • • z z 3 seconds for renewal + 2 seconds bus settle time x2 to give owner two chances to renew If still clear, then former owner loses lease Challenger issues reserve to acquire semaphore Challenge/Defense Protocol: Successful Defense CS514 Defender Node Reserve Reserve 0 1 2 3 4 Reserve Reserve Reserve 5 6 7 Bus Reset 8 9 10 11 12 13 14 15 16 Reservation detected Challenger Node 19 Challenge/Defense Protocol: Successful Challenge CS514 Defender Node Reserve 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Bus Reset Challenger Node No reservation detected Reserve Regroup CS514 | | | | Invariant: All members agree on {members} Regroup re-computes {members} Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages z z | Uses a 5-round protocol to agree z z | Suspicion that sender is dead Failure detection in bounded time Checks communication among nodes Suspected missing node may survive Upper levels (global update, etc.) informed of regroup event 20 Membership State Machine Initialize Sleeping Start Cluster Search Fails Member Search Minority or no Quorum Found Online Member Joining Search or Reserve Fails Quorum Disk Search Acquire (reserve) Quorum Disk Regroup NonNon-Minority and Quorum Lost Heartbeat Join Succeeds CS514 Forming Synchronize Succeeds Online Joining A Cluster CS514 | | | When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which z Looks in local (stale) registry for members z Asks each member in turn to sponsor new node’s membership. (Stop when sponsor found.) Sponsor (any active member) z Sponsor authenticates applicant z Broadcasts applicant to cluster members z Sponsor sends updated registry to applicant z Applicant becomes a cluster member 21 Forming A Cluster (When Joining Fails) | | | Use registry to find quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource E.g., if we were down when it was in use z | | | Form new one-node cluster Bring other cluster resources online Let others join your cluster Leaving A Cluster (Gracefully) | | CS514 CS514 Pause: z Move all groups off this member. z Change to paused state (remains a cluster member) Offline: z Move all groups off this member. z Sends ClusterExit message all cluster members • • Prevents regroup Prevents stalls during departure transitions Close Cluster connections (now not an active cluster member) z Cluster service stops on node Evict: remove node from defined member list z | 22 Leaving A Cluster (Node Failure) | | z z | Minority group OR no quorum device: • Group does NOT survive Non-minority group AND quorum device: • Group DOES survive Non-Minority rule: z z | CS514 Node (or communication) failure triggers Regroup If after regroup: Number of new members >= 1/2 old active cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster Quorum guarantees correctness z Prevents “split-brain” • E.g., with newly forming cluster containing a single node Global Update CS514 | | | | | | Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership z All are up z All can communicate R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol 23 Global Update Algorithm | CS514 ac k | L Cluster has locker node that regulates updates z Oldest active node in cluster Send Update to locker node Update other (active) nodes z In seniority order (e.g. locker first) z This includes the updating node Failure of all updated nodes: S z Update never happened z Updated nodes will roll back on recovery Survival of any updated nodes: z New locker is oldest and so has update if any do z New locker restarts update X= 10 0! | | | Cluster Registry CS514 | | Separate from local Windows NT Registry Maintains cluster configuration z | | Members, resources, restart parameters, etc. Stable storage Replicated at each member z z Global Update protocol Windows NT Registry keeps local copy 24 Cluster Registry Bootstrapping CS514 | Membership uses Cluster Registry for list of nodes z | …Circular dependency Solution: z z z Membership uses stale local cluster registry Refresh after joining or forming cluster Master is either • • Quorum device, or Active members Resource Monitor CS514 | Polls resources: z | Detects failures z z | IsAlive and LooksAlive polling failure failure event from resource Higher levels tell it z z Online, Offline Restart 25 Failover Manager Failover Manager CS514 | Assigns groups to nodes based on z z z Resource Monitor Failover parameters Cluster Registry Possible nodes Global Update for each resource Membership in group Regroup Preferred nodes for resource group Windows NT Server Cluster Disk Driver Cluster Net Drivers Resource CS514 | Fails over (moves) from one machine to another z z z z | | | Logical disk IP address Server application Database Online at only one machine May depend on another resource Well-defined properties controlling its behavior 26 Resource Properties CS514 | | Resource type Poll intervals z z | Looksalive Isalive Private resource data z z Group membership | Possible nodes | Restart policy | Dependencies | Unique identifier Hardware binding Time CS514 | | | | Time must increase monotonically z Otherwise applications get confused z E.g., make/nmake/build Time is maintained within failover resolution z Not hard, since failover on order of seconds Time is a resource, so one node owns time resource Other nodes periodically correct drift from owner’s time 27 Win 2000 MSCS Cluster Client PCs Server B Server A Heartbeat Disk cabinet 1 Cluster management Disk cabinet 2 Clients Clients Server C Server D Application Evolution MSCS Version NT 4.0 CS514 Application Microsoft SQL Server Node 1 Node 2 ; ; Microsoft Transaction Server (MTS) Internet Information Server (IIS) Microsoft Exchange Server ; ; 28 Application Evolution MSCS Version Windows 2000 CS514 Application Node 1 Node 2 Node 3 Node 4 ; ; ; ; ; ; ; ; Internet Information Server (IIS) ; ; ; ; Microsoft Exchange Server ; ; ; ; Microsoft SQL Server Microsoft Transaction Server (MTS) 29