* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Distributed Operating Systems
Survey
Document related concepts
Transcript
P00603 Lectures 2 Distributed Systems C Cox Distributed systems Evolution Principles Pervasive computing Transparency Distributed file systems Distributed operating systems Recovery and fault tolerance • • • • Reading A.S. Tanenbaum and M. van Steen (2003) Distributed Systems principles and paradigms, , Prentice Hall, ** G. Colouris, J, Dollimore and T. Kindberg (2005) Distributed Systems – concepts and design, Addison Wesley, 4th Edition ** B. Wilkinson and M Allen (2004) Parallel Programming, techniques and applications using networked workstations and parallel computers, Prentice Hall. * Middleware Architecture with Patterns and Frameworks, http://proton.inrialpes.fr/~krakowia/MW-Book/Chapters/Intro/intro.html 1 What are distributed computer systems? Compare: Centralised Systems • • • • • • • One system with non autonomous parts System shared by users all the time All resources accessible Software runs in a single process (often) single physical location Single point of control (and management) Single point of failure 2 Distributed Computer Systems - 1 • • • • • • • • • Multiple autonomous components Components shared by users Resources may not be accessible Software can run in concurrent processes on different processors (often) multiple physical locations Multiple points of control Multiple points of failure No global time No shared memory 3 Distributed Computer Systems - 3 Tanenbaum definition: A distributed system is: A collection of independent computers that appears to its users as a single coherent system. (idea of a virtual computer ) . . autonomous computers . . connected by a network . . specifically designed to provide an integrated computing environment 4 Evolution of Distributed Computer Systems • • • • • • • 1960 centralized Mainframe happy system manager centralized control sad users single point of failure 1970 localized Minicomputers localized control 1980 de-centralized PC’s sad system manager user control happy users 1985 networked PC’s on LAN and WAN client server 1990 distributed Systems happy system manager distributed management happy users distributed applications middleware. virtual computing 2000 internet computing, grid, web services, cluster and cloud computing 2010 Mobile, ubiquitous and pervasive computing 5 • • • • 1970’s email electronic data interchange (EDI) Ethernet 1980’s PC Client-Server computing RPC 1990’s WWW PVM, MPI CORBA XML GRID 2000’s .NET Web Services SOAP Pervasive computing Grid Distributed Computing Notable Developments 6 Computer processor performance evolution (see http://www.top500.org/) 7 Motivation for Distributed Computer Systems • • • • • • • • • High cost of powerful single processor – its cheaper (£/MIP) to buy many small machines and network them than buy a single large machine. Since 1980 computer performance has increased at 1.5 per year Share resources Distributed applications and mobility of users Efficient low cost networks Availability and Reliability – if one component fails the system will continue Scalability – easier to upgrade system by adding more machines, than replace the only system Computational speedup Service Provision -Need for resource and data sharing and remote services Need for communication 8 More complex distributed computing examples - 1 Computing dominated problems (distributed processing) Computational Fluid Dynamics (CFD) and Structural Dynamics (using Finite Element Method) Environmental and Biological Modeling – human genome project, pollution and disease control, traffic simulation, weather and climate modeling, global climate model Economic and Financial modeling Graphics rendering for visualization Network Simulation – telecommunications, power grid 9 More complex distributed computing examples - 2 Storage dominated problems (distributed data) • Distributed databases: Google BigTable, Amazon Dynamo, Windows Azure • Peer network data stores: BitTorrent, Chord, GNUnet • Applications • Data Mining • Image Processing • Seismic data analysis • Insurance Analysis 10 More complex distributed computing examples - 3 Communications dominated problems • Transaction processing – banks, credit cards, EFTPOS • Video on Demand – (also Text, Image and Simulation on demand) • Electronic banking, electronic shopping • Search engine e.g. Google 11 Ubiquitous Computing (or pervasive computing) Concerned with the increased integration (and transparency) of computing devices [Weiser, 1991] Computers everywhere, but hidden (embedded into the environment) • First Wave – Mainframe: 1 processor N users • Second Wave – PC: 1 processor 1 user ... transition ... Distributed Computing: N users N processors • Third Wave – Ubiquitous Computing: 1 user N processors 12 Pervasive computing - concept Computers and computational devices are everywhere • But they are not always obvious • They become part of the environment Google Trends “Pervasive Computing” 13 Computing: The Trend 14 Distributed Computer System Metrics • • • • • • • • Latency – network delay before any data is sent Bandwidth – maximum channel capacity (analogue communication Hz, digital communication bps) Granularity – relative size of units of processing required. Distributed systems operate best with coarse grain granularity because of the slow communication compared to processing speed in general Processor speed – MIPS, FLOPS Reliability – ability to continue operating correctly for a given time Fault tolerance – resilience to partial system failure Security – policy to deal with threats to the communication or processing of data in a system Administrative/management domains – issues concerning the ownership and access to distributed systems components 15 Distributed Computer System Architectures • • • Flynn, 1966+1972 classification of computer systems in terms of instruction and data stream organizations Based on Von-Neumann model (separate processor and memory units) 4 machine organizations • • • SISD - Single Instruction, Single Data SIMD - Single Instruction, Multiple Data MISD - Multiple Instruction, Single Data • MIMD - Multiple Instruction, Multiple Data SISD Computers SIMD MISD MIMD Pentium vector parallel ?? SM DM Distributed Computers are essentially all MIMD machines SM – shared memory, multiprocessor, e.g. SUN Sparc 1 DM – distributed memory, multicomputer,Cray e.g. LAN Cluster 16 Flynn Architectures II CU D PU Serial Processor SISD PU CU I D1 CU I1 PU D Array Processor PU Dn CU In No real examples - possibly some pipeline architectures PU SIMD MISD CU I1 PU D1 Multiprocessor and Multicomputer CU In PU Dn CU – control unit PU – processor unit I – instruction stream D – data stream MIMD 17 Network Computing and Supercomputing High performance computing e.g. driver, worker parallel computing model using PVM or MPI Supercomputing Distributed Computing Network Computing Typically Client-Server computing with Sockets 18 Cluster Computing Supercomputing LAN Technology Cluster Computing Distributed Computing 19 GRID (Meta)Computing Supercomputing WAN Technology GRID Computing Distributed Computing 20 Operation Transparency in a Distributed System Transparency Description Access Hide differences in data representation and how a resource is accessed, e.g. access NFS, SQL Location Hide where a resource is physically located, e.g. URL, tables in distributed database Migration (or mobility) Hide that a resource may move to another location, e.g. mobile phone Relocation Hide that a resource may be moved to another location while in use Replication If a resource is replicated among several locations, it should appear to the user as a single resource. E.g. distributed database, mirrored web site Cotd. . . 21 Cotd. Operation Transparency in a Distributed System Transparency Description Scaling (or Scalability) Users unaware when system size or specification changes (increase or decrease) e.g. world-wide-web or as applications change - except for change in quality of service Performance Users unaware that system is reconfigured to allow processing to be distributed automatically among the available processors Concurrency Hide that a resource may be shared by several competitive users Failure Hide the failure and recovery of a resource, e.g. email Persistence Hide whether a (software) resource is in memory or on disk 22 Design Issues Placement of Processes on Processors Considers the required or optimal placement of processes, applications or components onto processors, then considers the interrelation or communication between processes or components No Global Clock Therefore synchronization strategies are required Flexibility Important that systems operate efficiently during the intended lifetime and that future changes can be made easily Failure Handling, Reliability and Fault Tolerance Independent failures Detection, masking, fault tolerance and recovery Series reliability, Rseries R1 R2 R3 Parallel reliability, Rparallel 1 1 R1 1 R2 1 R3 Basis of reliability enhancement is replication for redundancy. Performance In terms of metrics: throughput (jobs per hour) system and network utilization Ability to utilize concurrent operation Must include communication and synchronization delays Must apply to all applications / overall system operation Must apply to a range of job (grain) sizes Need to consider Quality of Service, especially as service provider 23 Distributed Computing Paradigms • Network programming – the client server model using TCP or UDP, sockets and message passing • Concurrent Programming – UNIX fork and threads. Parallel programming using clusters of multicomputers and message passing libraries, e.g. PVM, MPI • Object Based Systems - Java RMI, CORBA etc. ref Tanenbaum Ch.9 • Distributed File Systems - e,g. NFS, (ref. Tanenbaum Ch.10) • Document Based Systems - e.g. Lotus Notes and WWW (ref Tanenbaum Ch.11) • Distributed Coordination Based Systems - e.g. Linda, TIB, Java Jini, (ref Tanenbaum Ch.12) 24 Distributed File Systems • File system that allows access to hosts sharing a network • Clients do not have control of the file storage but access using a communication protocol • Transparency is essential functional requirement • Distributed file systems may offer replication and hence fault tolerance • Distributed file systems usually use a LAN • Distributed File Stores often use a WAN Microsoft DFS Implementation • Standalone allows DFS root on local computer, e.g Windows NT • Domain based, store DFS configuration within an active directory 25 Hadoop File System (Apache Foundation) • • • • • • • • Hadoop Distributed File System (HDFS) Open source, derived from Google GFS High fault tolerance, high throughput Client-server architecture Multiple servers, each storing part of file system data Fault detection and quick automatic recovery Good scalability (single server to thousands of machines) HDFS cluster has 1 Namenode (to manage file system name space and regulate access by clients and maintain meta data) and 1 Datanode per cluster node. • Uses replication: file sequenced into blocks, all same size except last, default is 3. Provides fault tolerance 26 Hadoop (more . . .) • • • • • • • HDFS comms protocol layered on top of TCP/IP Uses RPC Java API available, provision also for C and Python Download from Apachi Runs on windows, Unix, OS X Map and reduce – mechanism to fracture and recombine blocks Hadoop clusters used commercially, e.g. Amazon, Facebook, AOL, Google, EBay • Easiest implementation with write once, read many 27 Distributed Operating Systems-1 Evolution: Centralized (Uniprocessor) Operating System • based on centralized systems • resource management, process management, multitasking • IPC, I/O, interrupt handling etc. • E.g. MS-DOS, VAX VMS Network Operating System • extension of centralized operating systems offer local services to remote clients • each processor has own operating system • user owns a machine, but can access others (e.g. rlogin, telnet) • no global naming of resources • system has little fault tolerance • e.g. UNIX, Windows NT, 2000 etc 28 Uniprocessor Operating Systems • • Separating applications from operating system code through a microkernel. 1.11 29 Network Operating System - structure • General structure of a network operating system. 1-19 30 Distributed (Multicomputer) Operating Systems-2 Distributed Operating Systems • Allows a multiprocessor or multicomputer network resources to be integrated as a single system image • Hide and manage hardware and software resources • provides transparency support • provide heterogeneity support • control network in most effective way • consists of low level commands + local operating systems + distributed features • Inter-process communication (IPC) • remote file and device access • global addressing and naming • trading and naming services • synchronization and deadlock avoidance • resource allocation and protection • global resource sharing • deadlock avoidance • communication security • no examples in general use but many research systems: Amoeba, Chorus etc. see Google “distributed systems research” • Network operating system with middleware popularly considered a distributed operating system 31 Multicomputer Operating Systems (1) • General structure of a multicomputer operating system 1.14 32 Middleware acting as a distributed operating system 1.1 middleware layer at top of NOS implementing general-purpose services to provide distribution transparency 33 The Amoeba Distributed Operating System • • • • • • Developed by A.S. Tanenbaum (1983 onwards) as a research tool, it uses a large number of CPUs, . Communication via RPC and provides relatively good distributive transparency, security with efficient communication but suffers from a lack of user control Aim was to develop a transparent distributed operating system for heterogeneous workstation and/or processor pool networks Personal multiprocessor (rather than networked multi-computer) concept Individual processors are not owned, users log onto system anywhere System allocates processors as needed, dynamically so system looks like a single virtual machine. Amoeba takes each user command then determines which CPU to execute on Written in C but has its own parallel and distributed programming language ORCA 34 Replication of Data - maintaining copies on multiple computers (e.g. Distributed Database) Requirements • • Replication transparency – clients unaware of multiple copies Consistency of copies Benefits • • • • • • Performance enhancement, e.g. replicate heavily loaded servers Reliability enhancement Data closer to client Share workload Increased availability 1 p n Increased fault tolerance Constraints • • • How to keep data consistency (need to ensure a satisfactorily consistent image for clients) Where to place replicas and how updates are propagated Scalability 35 Data Centric Consistency Models Distributed Data e.g. DFS, Distributed Database, Distributed Shared Memory • • Operate on a single (virtual) data store With processes on different processors the lack of global clock makes absolute synchronisation difficult Consistency Models • • • Strict – a read must return values from most recent write (so generally impossible in practice) Linearizable and sequential – maintain causality using synchronised clocks Causal and FIFO – weaker consistency. Writes from same processor are in same order, writes from different processor not guaranteed 36 Fault Tolerant Services • • • • Improve availability/fault tolerance using replication Provide a service with correct behaviour despite n process/server failures, as if there was only one copy of data Use of replicated services Operations need to be linearizable and sequentially consistent when dealing with distributed read and write operations (see Coulouris). Fault Tolerant System Architectures Client (C) Front End (FE) = client interface Replica Manager (RM) = service provider 37 single primary replica manager Passive Replication Primary C FE RM RM Backup C • • • • • • • • FE RM Backup All client requests (via front end processes) directed to nominated primary replica manager (RM) Single primary RM together with one or more secondary replica managers (operating as backups) Single primary RM responsible for all front end communication – and updating of backup RM’s Distributed applications communicate with primary replica manager, which sends copies of up to date data. Requests for data update from client interface to primary RM is distributed to each backup RM If primary replica manager fails a secondary replica manager observes this and is promoted (elected) to act as primary RM To tolerate n process failures need n+1 RM,s Passive replication cannot tolerate Byzantine failures 38 Passive Replication – how it works • • • • • • • FE request is issued to primary RM, each with unique id Primary RM receives request Check request id, in case request has already been executed If request is an update the primary RM sends the updated state and unique request id to all backup RM’s Each backup RM sends acknowledgment to primary RM When ack. Is received from all backup RM’s the primary RM sends request acknowledgment to front end (client interface) All requests to primary RM are processed in n the order of receipt. 39 Multiple RM’s operating – where majority output is taken Active Replication RM C FE RM FE C RM • • • • • • • • Multiple (group) replica managers (RM), each with equivalent roles The RM’s operate as a group Each front end (client interface) multicasts requests to a group of RM’s requests processed by all RM’s independently (and identically) client interface compares all replies received can tolerate N out of 2N+1 failures, i.e. consensus when N+1 identical responses received Can tolerate byzantine failure 40 Active Replication – how it works • • • • Client request is sent to group of RM’s using totally ordered reliable multicast, each sent with unique request id Each RM processes the request and sends response/result back to the front end Front end collects (gathers) responses from each RM Fault Tolerance: Individual RM failures have little effect on performance. For n process fails need 2n+1 RM’s (to leave a majority n+1 operating). RM C FE RM RM 41 FE C The Gossip Architecture - 1 Concept: replicate data close to points where clients need it first. You exchange news with neighbors If you receive new news then you want to know it You gossip news to neighbours Aim is to provide high availability at expense of weaker data consistency • Framework for dealing with highly available services through use of replication • RM’s exchange (or gossip) in the background from time to time • Multiple replica managers (RM), single front end (FE) – sends query or update to any (one) RM • A given RM may be unavailable, but the system is to guarantee a service 42 Gossip Service - 2 (How it works) • Front End sends time stamped request to single replica manager • If request is Query Front End (Client) blocks waiting for a reply – since the data should be at every RM • If request is Update Update is carried out immediately by the local RM and replies to client, then the updates are propagated to other RM’s in a lazy fashon using gossip messages 43 Reliable Group Communication - 1 leave group send/receive Stop, crash or become faulty join Problem: Provide guarantee that all members in a process group receive a message. • • networks and protocols geared toward point-to-point process communications for small groups just use multiple point to point connections (recall pvm_mcast) Problem with larger groups: • • • with such complex communication schemes the probability of an error is increased a process may join, or leave, a group a process may become faulty, i.e. is a member of a group but unable to participate 44 Reliable Group Communication: simple case: Where members of a group are known and fixed • Sender assigns message sequence number to each message – so that receiver can detect missing message. • Sender retains message (in history buffer) until all receivers acknowledge receipt. • Receiver can request missing message (reactive) or sender can resend if acknowledgement not received after a certain time (proactive). • Important to minimize number of messages, so combine acknowledgement with next message. [Project Idea: reliable multicast process communication using message buffering] 45 Scalability in Reliable Multitasking: Simple with small groups Problem with large groups. With N receivers sender needs to accept N acknowledgments (positive acknowledgement) Receivers could acknowledge only if a message is missing (negative acknowledgement) - this gives some communications performance optimization 46 Distributed Atomic Multicast Problem • How to guarantee that message from sender is either received by all processes or by none. • Replicated database has 1 process per replica. If replica crashes during update it needs to schedule an update for later (see Tannenbaum). • When a crashed replica is restarted it needs to be rejoin the group and provision made to bring the replica to the same state as others. • Project Ideas! 47 Recovery • Once failure has occurred in many cases it is important to recover critical processes to a known state in order to resume processing • Problem is compounded in distributed systems Two Approaches: • Backward recovery, by use of checkpointing (global snapshot of distributed system status) to record the system state but checkpointing is costly (performance degradation) • Forward recovery, attempt to bring system to a new stable state from which it is possible to proceed (applied in situations where the nature if errors is known and a reset can be applied) 48 Backward Recovery - most extensively used in distributed systems and generally safest • can be incorporated into middleware layers • complicated in the case of process, machine or network failure • no guarantee that same fault may occur again (deterministic view – affects failure transparency properties) • can not be applied to irreversible (non-idempotent) operations, e.g. ATM withdrawal or UNIX rm * 49 Recovery- Combine Checkpointing with Message Logging (1) • • • • after each checkpoint all messages logged sender based or receiver based logging provides additional information to check that rollback recovery is possible Checkpointing can be expensive to incorporate time P1 P2 recovery line checkpoint 50 Recovery- Combine Checkpointing with Message Logging (2) time P1 P2 recovery line • • • • • checkpoint If either process P1 or P2 crashes we need to recover to most recent checkpoint that was not complicated by message passing activity An unfortunate sequence of messages (in which message activity is high compared to checkpoint frequency) can lead to cascaded rollback or domino effect (rollback leads to another rollback) Possible solution is to use globally coordinated checkpointing – which requires global time synchronization rather than independent (per processor) checkpointing Implementing independent checkpointing rollback requires process dependencies to be considered Coordinated checkpointing requires synchronization to provide global storage of system state. Saved states must be globally consistent, e.g. using a two phase blocking protocol 51 Backward Recovery - Using Two Phase Blocking Protocol Phase 1 • coordinator broadcasts ‘checkpoint request’ to all processes • processes receive message and saves local checkpoint – then queues subsequent messages received and acknowledges back to coordinator Phase 2 • when coordinator receives all replies it multicasts ‘checkpoint done’, so processes can continue • can improve algorithm by multicast to processes that depend on recovery of the coordinator 52 Distributed System Design – for message passing program design • • • • • • • • Start with sequential program Determine dependencies the identify code that can execute concurrently • Need to understand algorithm • Exploit inherent parallelism • My require some algorithm restructuring Decompose problem using control (functional) or data parallelism Consider available machine architecture Choose programming paradigm Determine communication and add message passing code Compile and test Optimise performance • Measure performance • Locate bottlenecks • Minimise message passing • Load balance 53 Problems with Distributed Computing • • • • • • • • • • • Few standards (particularly for WAN meta-computing) Lack of expertise of users, developers, and systems support staff High cost of good commercial software Immature software development tools Applications difficult and time consuming to develop Portability problems in heterogeneous environment Difficult to tune and optimise across all platforms Scheduling and load balancing Distributed fault tolerance difficult, especially over WAN Trade-off between performance in using remotely mounted files and disk space requirements using local disk. How to handle legacy systems - rewrite, add interface or use wrappers? 54