Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Building Blocks for High-Performance, Fault-Tolerant Distributed Systems Nancy Lynch Theory of Distributed Systems MIT AFOSR project review Cornell University May, 2001 1 Project Participants • Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman, Steve Garland • PhD students: Victor Luchangco, Roger Khazan, Carl Livadas, Josh Tauber, Ziv Bar-Joseph, Rui Fan • MEng: Rui Fan, Kyle Ingols, Igor Taraschanskiy, Andrej Bogdanov, Michael Tsai, Laura Dean • Collaborators: Roberto De Prisco, Jeremy Sussman, Keith Marzullo, Danny Dolev, Alan Fekete, Gregory Chockler, Roman Vitenberg 2 Project Scope • Define services to support high-performance distributed computing in dynamic environments: Failures, changing participants. • Design algorithms to implement the services. • Analyze algorithms: Correctness, performance, faulttolerance. • Develop necessary mathematical foundations: State machine models, analysis methods. • Develop supporting languages and tools: IOA. 3 Talk Outline I. View-oriented group communication services II. Non-view-oriented group communication III. Mathematical foundations IV. IOA language and tools V. Memory consistency models VI. Plans 4 I. View-Oriented Group Communication Services 5 View-Oriented Group Communication Services • Cope with changing participants using abstract groups of client processes with changing membership sets. • Processes communicate with group members by sending messages to the group as a whole. • GC services support management of groups: – Maintain membership information, form views. – Manage communication. – Make guarantees about ordering, reliability of message delivery. • Isis, Transis, Totem, Ensemble,… GC 6 Using View-Oriented GC Services • Advantages: – High-level programming abstraction – Hides complexity of coping with changes • Disadvantages: – Can be costly, especially when forming new views. – May have problems scaling to large networks. • Applications: – Managing replicated data – Distributed interactive games – Multi-media conferencing, collaborative work 7 Our Approach • Mathematical, using state machines (I/O automata) Application • Model everything: – Applications – Service specifications – Implementations of the services Service • Prove correctness • Analyze performance, fault-tolerance Application Algorithm 8 Our Earlier Work: VS [Fekete, Lynch, Shvartsman 97, 01] • Defined automaton models for: – VS, partitionable GC service, based on Transis – TO, non-view-oriented totally ordered bcast service – VStoTO, algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser] • Proved correctness • Analyzed performance, bcast fault-tolerance: conditional performance analysis brcv TO VStoTO VStoTO gprcv newview gpsnd VS 9 Conditional Performance Analysis • Assume VS satisfies: – If a network component C stabilizes, then soon thereafter, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within bounded time. • And VStoTO satisfies: – Simple timing and fault-tolerance assumptions. • Then TO satisfies: – If C stabilizes, then soon thereafter, any message sent or delivered anywhere in C is delivered everywhere in C, within bounded time. 10 Ensemble [Hickey, Lynch, van Renesse 99] • Ensemble system [Birman, Hayden], layered design: • Worked with developers, following VS. • Developed global specs for key layers. • Modeled Ensemble algorithm spanning between layers. • Tried proof; found algorithmic error. • Modeled, analyzed repaired system • Same error found in Horus. 11 More Recent Progress 1. GC with unique primary views 2. Scalable GC 3. Optimistic Virtual Synchrony 4. GC service specifications 12 1. GC With Unique Primary Views 13 GC With Unique Primaries • Dynamic View Service [De Prisco, Fekete, Lynch, Shvartsman 98] – Produces unique primary views – Copes with long-term changes. • Dynamic Configuration Service [DFLS 99] – Adds quorums. – Copes with long-term and transient changes. • Dynamic Leader Configuration Service [D 99], [DL 01] – Adds leaders. 14 GC With Unique Primaries • Algorithms to implement the services – Based on dynamic voting algorithm of [YegerLotem, Keidar, Dolev 97]. – Each primary needs majority of all possible previous primaries. – Models, proofs,… • Applications – Consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes. – Models, proofs,… 15 Availability of Unique Primary Algorithms [Ingols, Keidar 01] • Simulation study comparing unique primary algorithms: – [Yeger-Lotem, Keidar, Dolev], [DFLS] – 1-pending, like [Jajodia, Mutchler] – Majority-resilient 1-pending, like [Lamport], [Keidar, Dolev] • Simulate repeated view changes, interrupting other view changes. • Availability shown to depend heavily on: – Number of processes from previous view needed to form new view. – Number of message rounds needed to form a view. • [YKD], [DFLS] have highest availability. 16 2. Scalable Group Communication [Keidar, Khazan 00] 17 Group Communication Service • Manages group membership, current view. • Multicast communication among group members, with ordering, reliability guarantees. • Virtual Synchrony [Birman, Joseph 87] – Integrates group membership and group communication. – Processes that move together from one view to another deliver the same messages in the first view. – Useful for replicated data management. – Before announcing new view, processes must synchronize, exchange messages. 18 Example: Virtual Synchrony i j 3: i,j,k k 3: i,j,k 3: i,j,k mcast(m) rcv(m) 4: i, j rcv(m) 4: i, j VS algorithm supplies missing m19 Group Communication in WANs • Difficulties: – High message latency, message exchanges are expensive – Frequent connectivity changes • New, scalable GC algorithm: – Uses scalable GM service of [Keidar, Sussman, et al. 00], implemented on a small set of membership servers. – GC (with virtual synchrony) implemented on clients. VS Net VSGC GM 20 Group Communication in WANs • Try to minimize time from when network stabilizes until GC delivers new views to clients. • After stabilization: GM forms view, VSGC algorithm synchronizes. Net event VSGC Algorithm GM Algorithm view(v) • Existing systems (LANs): – GM, VSGC uses several message exchange rounds – Continue in spite of new network events • Inappropriate for WANs 21 New Algorithm • VSGC uses one message exchange round, in parallel with GM’s agreement on views. • GM usually delivers views in one message exchange. Net event VSGC Algorithm GM Algorithm view(v) • Responds to new network events during reconfiguration: – GM produces new membership sets – VSGC responds to membership changes • Distributed implementation [Tarashchanskiy 00] 22 Correctness Proofs • Models, proofs (safety and liveness) • Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00] • Proof Extension Theorem: S S’ A A’ • Used new methods for the safety proofs. 23 Performance Analysis • Analyze time from when network stabilizes until GC delivers new views to clients. • System is a composition: – Network service, GM services, VSGC processes • Compositional analysis: – Analyze the VSGC algorithm alone, in terms of its inputs and timing assumptions. – State reasonable performance guarantees for GM, Network. – Combine to get conditional performance properties for the system as a whole. 24 Analysis of VSGC algorithm • Assume component C stabilizes: – GM delivers same views to VSGC processes – Net provides reliable communication with latency . • Let – T[start], T[view] be times of last GM events for C – be upper bound on local step time. • Then VSGC outputs new views by time max (T[start] + + x, T[view]) + 25 Analysis of VSGC Algorithm + x VS Algorithm view(v) Net Event start start view(v) GM algorithm T[start] T[view] 26 Assumed Bounds for GM T[start] start T[view] start GM view(v) • Bounds for “Fast Path” of [Keidar, et al. 00], observed empirically in almost all cases. 27 Combining VSGC and GM Bounds • Bounds for system, conditional on GM bounds. + x VSGC start view(v) start view(v) T[start] T[view] GM 28 3. Optimistic Virtual Synchrony [Sussman, Keidar, Marzullo 00] • Most GC algorithms block sending during reconfiguration. • OVS service provides: – Optimistic view proposal, before reconfiguration. – Optimistic sends after proposal, during reconfiguration. – Deliveries of optimistic messages in next view, subject to application policy. • Useful for applications: – Replicated data management – State transfer – Sending vectors of data 29 4. GC Service Specifications [Chockler, Keidar, Vitenberg 01] • Comprehensive set of specifications for properties guaranteed by GC services. • Unifying framework. • Safety properties – – – – Membership: View order, partitionable, primary component Multicast: Sending view delivery, virtual synchrony Safe notifications Ordering, reliability: FIFO, causal, totally ordered, atomic • Liveness properties – For eventually stable components: View stability, multicast delivery, safe notification liveness – For eventually stable pairs 30 II. Non-View-Oriented Group Communication 1. Totally Ordered Multicast with QoS [Bar-Joseph, Keidar, Anker, Lynch 00, 01] 31 Totally Ordered Multicast with QoS • Multicast to dynamic group, subject to joins, leaves, and failures. • Global total ordering of messages • QoS: Message delivery latency • Built on reliable network with latency guarantees • Add ordering guarantees, preserve latency bounds. • Applications – State machine replication – Distributed games – Shared editing 32 Two Algorithms • Algorithm 1: Basic Totally Ordered Multicast – Sends, receives consistent with total ordering of messages. – Non-failing processes agree on messages from non-failing processes. – Latency: Constant, even with joins, leaves, failures. • Algorithm 2: Atomic Multicast – Non-failing processes agree on all messages. – Latency: • Joins, leaves only: Constant fail_i • With failures: Linear in f TOM Net fail_j 33 Local Node Process rcv(m) join leave mcast(m) FrontEnd_i joiners(s,J), leavers(s,J) end-slot(s) members(s,J) Ord_i Memb_i mcast(m) join leave mcast(join) mcast(leave) progress(s,j) Sniffer_i rcv(m) Net 34 Local Algorithm Operation • FrontEnd divides time into slots, tags messages with slots. • Ord delivers messages by slot, in order of process indices. • Memb determines slot membership. – Join, leave messages – Failures: • Algorithm 1 uses local failure detector. • Algorithm 2 uses consensus on failures. – Requires new dynamic version of consensus. • Timing-dependent 35 Architecture for Algorithm 2 TO-QoS Net GM 36 2. Scalable Reliable Multicast Services [Livadas 01] 37 SRM [Floyd, et al.] • • • • Reliable multicast to dynamic group. Built over IP multicast Based on requests (NACKs) and retransmissions Limits duplicate requests/retransmissions using: – Deterministic suppression: Ancestors suppress descendants, by scheduling requests/replies based on distance to source. – Probabilistic suppression: Siblings suppress each other, by spreading out requests/replies. 38 SRM Architecture SRM IPMcast 39 New Protocol • • • • Inspired by SRM Assume future losses occur on same link (locality). Uses deterministic suppression for siblings Elects, caches best requestor and retransmitter – Chooses requestor closest to source. – Chooses retransmitter closest to requestor. – Break ties with processor ids. 40 Best Requestor and Retransmitter S Retransmitter Requestor 41 Performance Analysis • Metrics: – Loss recovery latency: Time from detection of packet loss to receipt of first retransmission – Loss recovery overhead: Number of messages multicast to recover from a message loss • Protocol performance benefits: – Removes delays caused by probabilistic suppression – Following election of requestor and retransmitter: • Reduces latency by using best requestor and retransmitter. • Reduces overhead by using single requestor and retransmitter. 42 III. Mathematical Foundations • Incremental modeling and proof methods Khazan, Lynch, Shvartsman 00] [Keidar, – Proof Extension Theorem – Arose in Scalable GC work [Keidar, Khazan 00] • Hybrid Input/Output Automata [Lynch, Segala, Vaandrager 01] – Model for continuous and discrete system behavior – Useful for mobile computing? • Conditional performance analysis methods – For analyzing communication protocols – AFOSR MURI project (Berkeley) 43 IV. IOA Language and Tools I O A 44 IOA Language and Tools • Language for describing I/O automata: Garland, Lynch – Use to describe services and algorithms. • Front end: Garland I O A – Translates to Java objects – Completely rewritten this year. – Still needs support for composition. • Theorem-prover connection: Garland, Bogdanov – Connection with LP – Seeking connections: SAL, Isabelle, STeP, NuPRL 45 IOA Language and Tools • Simulator: Chefter, Ramirez, Dean – Has support for paired simulation. – Needs additions. – Being instrumented for invariant discovery using Ernst’s Daikon tool • Code generator: Tauber, Tsai – Local code-gen (translation to Java) running. – Needs composition, communication service calls, correctness proof. • Challenge examples 46 V. Multiprocessor Memory Models [Luchangco 01] 47 Memory Models • Establishes a general mathematical framework for specifying and reasoning about multiprocessor memories and the programs that use them. P1 P2 Pn read write Memory • Also applies to distributed shared memory. 48 Memory Models • Sequentially consistent memory: – Operations appear to happen in some sequential order. – Read operation returns latest value written to the location. • Processor consistent memory: – Reads overtake writes to other locations. – SPARC TSO, IBM 370 • Coherent memory • Memory with synchronization commands: – Fences, barriers, acquire/release,… – Release consistency, weak ordering, locking • Transactional memory 49 Programming restrictions: • Data-race-free (for use with weak ordering) • Properly labelled (for use with release consistency) • Two-phase locking 50 Formal Modeling Framework read(x) • Computation DAG – Partial order model for individual executions – Describes dependencies between operations – Doesn’t model entire programs. write(y,1) read(y) write(x,3) write(x,2) read(x) • Memory = set of computations with return values 51 Results Results I: Uses the framework to model, classify, compare memories and programming disciplines. Results II: Programming discipline + memory model stronger memory model: – Completely race-free + any memory sequential consistency – Race-free under locking + locking memory seq. consistency – 2-phase locking + locking memory serializable transactions Results III: Extend results to automaton models for programs and memory. 52 VI. Plans • Finish Scalable GC, TO Mcast with QoS, SRM. • More dynamic services: – Resource allocation, consensus, communication, distributed data management, location services,… – Services for mobile computing systems – Theory: Algorithms and lower bounds • Foundations: – Hybrid automata, add control theory and probability – Conditional performance analysis methods • IOA: – Solidify front end, simulator, theorem-prover connections – Finish code generation. 53