* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Brief Announcement: Atomic Consistency and Partition
Wake-on-LAN wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
Distributed operating system wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Computer network wikipedia , lookup
Distributed firewall wikipedia , lookup
Network tap wikipedia , lookup
& Programming Model and Protocols for Reconfigurable Distributed Systems COSMIN IONEL ARAD Doctoral Thesis Defense, 5th June 2013 KTH Royal Institute of Technology https://www.kth.se/profile/icarad/page/doctoral-thesis/ Presentation Overview • • • • • Context, Motivation, and Thesis Goals introduction & Design philosophy Distributed abstractions & P2P framework Component execution & Scheduling Distributed systems Experimentation – Development cycle: build, test, debug, deploy • scalable & consistent key-value store – System architecture and testing using – Scalability, Elasticity, and Performance Evaluation • Conclusions 2 Trend 1: Computer systems are increasingly distributed • For fault-tolerance – E.g.: replicated state machines • For scalability – E.g.: distributed databases • Due to inherent geographic distribution – E.g.: content distribution networks 3 Trend 2: Distributed systems are increasingly complex connection management, location and routing, failure detection, recovery, data persistence, load balancing, scheduling, self-optimization, access-control, monitoring, garbage collection, encryption, compression, concurrency control, topology maintenance, bootstrapping, ... 4 Trend 3: Modern Hardware is increasingly parallel • Multi-core and many-core processors • Concurrent/parallel software is needed to leverage hardware parallelism • Major software concurrency models – Message-passing concurrency • Data-flow concurrency viewed as a special case – Shared-state concurrency 5 Distributed Systems are still Hard… • … to implement, test, and debug • Sequential sorting is easy – Even for a first-year computer science student • Distributed consensus is hard – Even for an experienced practitioner having all the necessary expertise 6 Experience from building Chubby, Google’s lock service, using Paxos “The fault-tolerance computing community has not developed the tools to make it easy to implement their algorithms. The fault-tolerance computing community has not paid enough attention to testing, a key ingredient for building fault-tolerant systems.” [Paxos Made Live] Tushar Deepak Chandra Edsger W. Dijkstra Prize in Distributed Computing 2010 7 A call to action “It appears that the fault-tolerant distributed computing community has not developed the tools and know-how to close the gaps between theory and practice with the same vigor as for instance the compiler community. Our experience suggests that these gaps are non-trivial and that they merit attention by the research community.” [Paxos Made Live] Tushar Deepak Chandra Edsger W. Dijkstra Prize in Distributed Computing 2010 8 Thesis Goals • Raise the level of abstraction in programming distributed systems • Make it easy to implement, test, debug, and evaluate distributed systems • Attempt to bridge the gap between the theory and the practice of fault-tolerant distributed computing 9 We want to build distributed systems 10 by composing distributed protocols Application Consensus Broadcast Failure detector Network Timer Application Consensus Application Broadcast Consensus Failure detector Broadcast Network Timer Failure detector Network Timer Application Application Consensus Consensus Broadcast Broadcast Failure detector Failure detector Network Network Timer Timer Application Application Consensus Consensus Broadcast Broadcast Failure detector Failure detector Network Network Timer Timer 11 implemented as reactive, concurrent components Application Consensus Broadcast Failure detector Network Timer Application Application Consensus Application Broadcast Failure detector Network Broadcast Consensus Consensus Failure detector Network Timer Timer Broadcast Application Application Consensus Consensus Broadcast Broadcast Failure detector Failure detector Network Network Timer Timer Failure detector Application Application Consensus Consensus Broadcast Failure detector Network Timer Network Timer Broadcast Failure detector Network Timer 12 with asynchronous communication and message-passing concurrency Application Consensus Broadcast Failure detector Network Timer 13 Design principles • Tackle increasing system complexity through abstraction and hierarchical composition • Decouple components from each other – publish-subscribe component interaction – dynamic reconfiguration for always-on systems • Decouple component code from its executor – same code executed in different modes: production deployment, interactive stress testing, deterministic simulation for replay debugging 14 Nested hierarchical composition • Model entire sub-systems as first-class composite components – Richer architectural patterns • Tackle system complexity – Hiding implementation details – Isolation • Natural fit for developing distributed systems – Virtual nodes – Model entire system: each node as a component 15 Message-passing concurrency • Compositional concurrency – Free from the idiosyncrasies of locks and threads • Easy to reason about – Many concurrency formalisms: the Actor model (1973), CSP (1978), CCS (1980), π-calculus (1992) • Easy to program – See the success of Erlang, and Go, Rust, Akka, ... • Scales well on multi-core hardware – Almost all modern hardware 16 Loose coupling • “Where ignorance is bliss, 'tis folly to be wise.” – Thomas Gray, Ode on a Distant Prospect of Eton College (1742) • Communication integrity – Law of Demeter • Publish-subscribe communication • Dynamic reconfiguration 17 Design Philosophy 1. Nested hierarchical composition 2. Message-passing concurrency 3. Loose coupling 4. Multiple execution modes 18 Component Model • Event Event • Port Port • Component • Channel • Handler Component channel handler • Subscription • Publication / Event trigger 19 20 21 A simple distributed system Process1 Process2 Application Application handler1 <Ping> handler2 Ping <Pong> handler1 Pong <Ping> handler2 <Pong> Network Network Network Network handler <Message> handler Pong < > Network Comp handler <Message> handler < Ping > Network Comp 22 A Failure Detector Abstraction using a Network and a Timer Abstraction Eventually Perfect Failure Detector Ping Failure Detector Network + Eventually Perfect Failure Detector Timer Network Timer MyNetwork MyTimer + Suspect Restore StartMonitoring StopMonitoring 23 A Leader Election Abstraction using a Failure Detector Abstraction Leader Election Ω Leader Elector Eventually Perfect Failure Detector + Leader Election + Leader Eventually Perfect Failure Detector Ping Failure Detector 24 A Reliable Broadcast Abstraction using a Best-Effort Broadcast Abstraction Broadcast + Broadcast Reliable Broadcast Broadcast Broadcast Best-Effort Broadcast Network Deliver + Broadcast RbBroadcast RbDeliver Broadcast Deliver BebBroadcast BebDeliver Broadcast Deliver 25 A Consensus Abstraction using a Broadcast, a Network, and a Leader Election Abstraction Consensus Paxos Consensus Broadcast Network Leader Election Broadcast Network Leader Election Best-Effort Broadcast MyNetwork Ω Leader Elector 26 A Shared Memory Abstraction Atomic Register + ABD Broadcast Atomic Register Network Broadcast Best-Effort Broadcast + ReadResponse WriteResponse ReadRequest WriteRequest Network Network MyNetwork 27 A Replicated State Machine using a Total-Order Broadcast Abstraction Replicated State Machine + Replicated State Machine State Machine Replication + Total-Order Broadcast Total-Order Broadcast Uniform Total-Order Broadcast + + Consensus + Execute Total-Order Broadcast Consensus Paxos Consensus Output Consensus + TobDeliver TobBroadcast Decide Propose 28 Probabilistic Broadcast and Topology Maintenance Abstractions using a Peer Sampling Abstraction Topology Probabilistic Broadcast T-Man Epidemic Dissemination Network Peer Sampling Peer Sampling Network Peer Sampling Cyclon Random Overlay Network Timer 29 A Structured Overlay Network implements a Distributed Hash Table Distributed Hash Table Structured Overlay Network Overlay Router Consistent Hashing Ring Topology One-Hop Router Chord Periodic Stabilization Network Peer Sampling Failure Detector Peer Sampling Cyclon Random Overlay Network Network Failure Detector Ping Failure Detector Network Timer Network Timer Timer 30 A Video on Demand Service using a Content Distribution Network and a Gradient Topology Overlay Video On-Demand Network Content Distribution Network Content Distribution Network BitTorrent Network Tracker Tracker Tracker Peer Exchange Distributed Tracker Peer Sampling Distributed Hash Table Timer Gradient Topology Gradient Topology Gradient Overlay Peer Sampling Network Tracker Centralized Tracker Client Network Timer 31 Generic Bootstrap and Monitoring Services provided by the Kompics Peer-to-Peer Protocol Framework PeerMain BootstrapServerMain MyWebServer MonitorServerMain MyWebServer MyWebServer – Web – Web – Web + + + Web Peer Web BootstrapServer Web MonitorServer – Network – Timer – Network – Timer – Network – Timer + + + + + + Network MyNetwork Timer MyTimer Network MyNetwork Timer MyTimer Network MyNetwork Timer MyTimer 32 Whole-System Repeatable Simulation Deterministic Simulation Scheduler Network Model Experiment Scenario 33 Experiment scenario DSL • Define parameterized scenario events – Node failures, joins, system requests, operations • Define “stochastic processes” – – – – Finite sequence of scenario events Specify distribution of event inter-arrival times Specify type and number of events in sequence Specify distribution of each event parameter value • Scenario: composition of “stochastic processes” – Sequential, parallel: 34 Local Interactive Stress Testing Work-Stealing Multi-Core Scheduler Network Model Experiment Scenario 35 execution profiles • Distributed Production Deployment – One distributed system node per OS process – Multi-core component scheduler (work stealing) • Local / Distributed Stress Testing – Entire distributed system in one OS process – Interactive stress testing, multi-core scheduler • Local Repeatable Whole-system Simulation – Deterministic simulation component scheduler – Correctness testing, stepped / replay debugging 36 Incremental Development & Testing • Define emulated network topologies – processes and their addresses: <id, IP, port> – properties of links between processes • latency (ms) • loss rate (%) • Define small-scale execution scenarios – the sequence of service requests initiated by each process in the distributed system • Experiment with various topologies / scenarios – Launch all processes locally on one machine 37 Distributed System Launcher 38 The script of service requests of the process is shown here… After the Application completes the script it can process further commands input here… 39 Programming in the Large • Events and ports are interfaces – service abstractions – packaged together as libraries • Components are implementations – provide or require interfaces – dependencies on provided / required interfaces • expressed as library dependencies [Apache Maven] • multiple implementations for an interface – separate libraries • deploy-time composition 40 Kompics Scala, by Lars Kroll 41 Kompics Python, by Niklas Ekström 42 Case study A Scalable, Self-Managing Key-Value Store with Atomic Consistency and Partition Tolerance 43 Key-Value Store? • Store.Put(key, value) OK • Store.Get(key) value [write] [read] Put(”www.sics.se”, ”193.10.64.51”) OK Get(”www.sics.se”) ”193.10.64.51” 44 Consistent Hashing Dynamo Incremental scalability Self-organization Simplicity Project Voldemort 45 Single client, Single server Put(X, 1) Ack(X) Get(X) Return(1) Client Server X=1 X=0 46 Multiple clients, Multiple servers Put(X, 1) Ack(X) Client 1 Get(X) Return(0) Get(X) Return(1) Client 2 Server 1 X=1 X=0 Server 2 X=0 X=1 47 Atomic Consistency Informally • put/get ops appear to occur instantaneously • Once a put(key, newValue) completes – new value immediately visible to all readers – each get returns the value of the last completed put • Once a get(key) returns a new value – no other get may return an older, stale value 48 Distributed Hash Table CATS Node Web Status Web Load Balancer Aggregation CATS Web Application Network Status Overlay Router Distributed Hash Table Peer Sampling Broadcast Aggregation Broadcast Network Peer Sampling Status Overlay Router Ring Topology Network Ring Topology Timer Timer Timer Status Network Failure Detector Bootstrap Client Network Network Network Replication Status Group Member Data Transfer Network Status Bulk Data Transfer Timer Status Local Store Local Store Ping Failure Detector Timer Timer Replication Data Transfer Consistent Hashing Ring Failure Detector Network Garbage Collector Reconfiguration Coordinator Bootstrap Network Network Status Status Status Cyclon Random Overlay Bootstrap Status Operation Coordinator Epidemic Dissemination Peer Sampling Status Monitor Peer Status Distributed Hash Table Status One-Hop Router Network Peer Status Network Status Persistent Storage Timer Timer 49 Simulation and Stress Testing CATS Simulation Main CATS Simulator Simulation Scheduler Web Web DHT Web DHT Web DHT Web DHT Web DHT CATS Node Web DHT CATSWeb Node DHT CATS Node Web DHT CATS Node Timer Web DHT CATS Network Node Web DHT Network Timer CATS Node Web DHT CATS Network Timer Node Web DHT Network Timer CATS Node Web DHT CATS Network Timer Node Web DHT Network Timer CATS Node Web DHT Network Timer CATS Node Timer Web DHT CATS Network Node Network Timer CATS Node Network Timer CATS Node Timer Network CATS Node Network Timer CATS Node Timer Network Network Timer Network Timer Network Timer CATS Stress Testing Main CATS Simulator Multi-core Scheduler Web Web DHT Web DHT Web DHT Web DHT Web DHT CATS Node Web DHT CATSWeb Node DHT CATS Node Web DHT CATS Node Timer Web DHT CATS Network Node Web DHT Network Timer CATS Node Web DHT CATS Network Timer Node Web DHT Network Timer CATS Node Web DHT CATS Network Timer Node Web DHT Network Timer CATS Node Web DHT Network Timer CATS Node Timer Web DHT CATS Network Node Network Timer CATS Node Network Timer CATS Node Timer Network CATS Node Network Timer CATS Node Timer Network Network Timer Network Timer Network Timer Network Timer CATS Experiment Network Timer CATS Experiment Network Timer CATS Experiment Network Timer CATS Experiment Network Model Discrete-Event Simulator Experiment Scenario Network Model Generic Orchestrator Experiment Scenario 50 Example Experiment Scenario 51 Reconfiguration Protocols Testing and Debugging • Use whole-system repeatable simulation • Protocol Correctness Testing – Each experiment scenario is a Unit test • Regression test suite – Covered all “types” of churn scenarios – Tested each scenario for 1 million RNG seeds • Debugging – Global state snapshot on every change – Traverse state snapshots • Forward and backward in time 52 Global State Snapshot: 25 joined 53 Snapshot During Reconfiguration 54 Reconfiguration Completed OK Distributed Systems Debugging Done Right! 55 CATS Architecture for Distributed Production Deployment 56 Demo: SICS Cluster Deployment 57 58 An Interactive Put Operation An Interactive Get Operation 59 CATS Architecture for Production Deployment and Performance Evaluation CATS Client Main CATS Peer Main YCSB Benchmark Bootstrap Server Main Application Jetty Web Server Jetty Web Server Distributed Hash Table DHT Web Web Distributed Hash Table DHT Web Web CATS Node CATS Client CATS Bootstrap Server Network Timer Network Timer Network Timer Network Timer Network Timer Network Timer Grizzly Network MyTimer Grizzly Network MyTimer Grizzly Network MyTimer 60 Experimental Setup • 128 Rackspace Cloud Virtual Machines – 16GB RAM, 4 virtual cores – 1 client for every 3 servers • Yahoo! Cloud Serving Benchmark (YCSB) – Read-intensive workload: 95% reads, 5% writes – Write-intensive workload: 50% reads, 50% writes • CATS nodes equally-distanced on the ring – Avoid load-imbalance 61 Performance (50% reads, 50% writes) 62 Performance (95% reads, 5% writes) 63 Scalability (50% reads, 50% writes) 64 Scalability (95% reads, 5% writes) 65 Elasticity (read-only workload) * Experiment ran on SICS cloud machines [1 YCSB client, 32 threads] 66 Overheads (50% reads, 50% writes) 24% 67 Overheads (95% reads, 5% writes) 4% 68 CATS vs Cassandra (50% read, 50% write) 69 CATS vs Cassandra (95% reads, 5% writes) 70 Summary Atomic Data Consistency Scalability Elasticity Decentralization Network Partition Tolerance Selforganization Fault Tolerance Atomic data consistency is affordable! 71 Related work • Dynamo [SOSP’07], Cassandra, Riak, Voldemort – scalable, not consistent (key-value stores) • Chubby [OSDI’06], ZooKeeper (meta-data stores) – consistent, not scalable, not auto-reconfigurable • RAMBO [DISC’02], RAMBO II [DSN’03], SMART [EuroSys’06], RDS [JPDC’09], Dynastore [JACM’11] (replication systems) – Reconfigurable, consistent, not scalable • Scatter [SOSP’11] – Scalable and linearizable DHT – Reconfiguration needs distributed transactions 72 is practical • • • • • • scalable key-value stores structured overlay networks gossip-based protocols peer-to-peer media streaming video-on-demand systems NAT-aware peer-sampling services • Teaching: broadcast, concurrent objects, consensus, replicated state machines, etc. 73 Related work • Component models and ADLs: Fractal, OpenCom, ArchJava, ComponentJ, … – blocking interface calls vs. message passing • Protocol composition frameworks: x-Kernel, Ensemble, Horus, Appia, Bast, Live Objects, … – static, layered vs. dynamic, hierarchical composition • Actor models: Erlang, Kilim, Scala, Unix pipes – flat / stacked vs. hierarchical architecture • Process calculi: π-calculus, CCS, CSP, Oz/K – synchronous vs. asynchronous message passing 74 Summary • Message-passing, hierarchical component model facilitating concurrent programming • Good for distributed abstractions and systems • Multi-core hardware exploited for free • Hot upgrades by dynamic reconfiguration • Same code used in production deployment, deterministic simulation, local execution • DSL to specify complex simulation scenarios • Battle-tested in many distributed systems 75 Acknowledgements • Seif Haridi • Jim Dowling • • • • • • • Tallat M. Shafaat Muhammad Ehsan ul Haque Frej Drejhammar Lars Kroll Niklas Ekström Alexandru Ormenișan Hamidreza Afzali 76 http://kompics.sics.se/ BACKUP SLIDES Sequential consistency • A concurrent execution is sequentially consistent if there is a sequential way to reorder the client operations such that: – (1) it respects the semantics of the objects, as defined by their sequential specification – (2) it respects the order of operations at the client that issued the operations 79 Linearizability • A concurrent execution is linearizable if there is a sequential way to reorder the client operations such that: – (1) it respects the semantics of the objects, as defined by their sequential specification – (2) it respects the order of non-overlapping operations among all clients 80 Consistency: naïve solution Replicas act as a distributed shared-memory register 35 30 r1 20 r2 40 r3 45 81 The problem Asynchrony Impossible to accurately detect process failures 82 Incorrect failure detection 52 50 Key 45 42 Nonintersecting quorums 48 X •50 thinks 48 has failed • 48 thinks replication group for (42, 48] is {48, 50, 52} • 50 thinks replication group for (42, 48] is {50, 52, 60} • PUT(45) may contact majority quorum {48, 50} • GET(45) may contact majority quorum {52, 60} successor pointer predecessor pointer 83 Solution: Consistent Quorums • A consistent quorum is a quorum of nodes that are in the same view when the quorum is assembled – Maintain consistent view of replication group membership – Modified Paxos using consistent quorums – Essentially a reconfigurable RSM (state == view) • Modified ABD using consistent quorums – Dynamic linearizable read-write register 84 Guarantees • Concurrent reconfigurations are applied at every node in a total order • For every replication group, any two consistent quorums always intersect – Same view, consecutive, non-consecutive views • In a partially synchronous system, reconfigurations and operations terminate – once network partitions cease • Consistent mapping from key-ranges to replication groups 85