Download Z - Aidan Hogan

CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2014 Aidan Hogan [email protected] Lecture III: 2014/03/23 Lab 1.1: Mensaje • New deadline: Tuesday 10am  • −1 (out of 10) for every day late after that TYPES OF DISTRIBUTED SYSTEMS … Client–Server Model • Client makes request to server • Server acts and responds (For example: Email, WWW, Printing, etc.) Client–Server: Three-Tier Server Server Data Logic Presentation Add all the salaries Create HTML page SQL: Query salary of all employees HTTP GET: Total salary of all employees Peer-to-Peer: Unstructured Pixie’s new album? (For example: Kazaa, Gnutella) Peer-to-Peer: Structured (DHT) • Circular DHT: – Only aware of neighbours – O(n) lookups 000 111 001 110 010 • Implement shortcuts – Skips ahead – Enables binary-searchlike behaviour – O(log(n)) lookups 101 011 100 Pixie’s new album? 111 Desirable Criteria for Distributed Systems • Transparency: – Appears as one machine • Flexibility: – Supports more machines, more applications • Reliability: – System doesn’t fail when a machine does • Performance: – Quick runtimes, quick processing • Scalability: – Handles more machines/data efficiently LIMITATIONS OF DISTRIBUTED SYSTEMS: EIGHT FALLACIES Eight Fallacies • By L. Peter Deutsch (1994) – James Gosling (1997) “Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause big trouble and painful learning experiences.” — L. Peter Deutsch • Each fallacy is a false statement! What might these fallacies of distributed computing be based on our experience? 1. The network is reliable Machines fail, connections fail, firewall eats messages • flexible routing • retry messages • acknowledgements! 2. Latency is zero There are significant communication delays • avoid “races” • local order ≠ remote order • acknowledgements • minimise remote calls M2: Copy X from M1 M1: Store X M2 – batch data! • avoid waiting – multiple-threads M1 3. Bandwidth is infinite M1: Copy X (10GB) Limited in amount of data that can be transferred • avoid resending data • direct connections • caching!! M1: Copy X (10GB) M2 M1 4. The network is secure Network is vulnerable to hackers, eavesdropping, viruses, etc. M1: Send Medical History • send sensitive data directly • isolate hacked nodes – hack one node ≠ hack all nodes • authenticate messages • secure connections M1 5. Topology doesn’t change How machines are physically connected may change (“churn”)! • avoid fixed routing – next-hop routing? • abstract physical addresses • flexible content structure M3 M2 Message M5 thru M2, M3, M4 M4 M1 M5 6. There is one administrator Different machines have different policies! • Beware of firewalls! • Don’t assume most recent version – Backwards compat. 7. Transport cost is zero It costs time/money to transport data: not just bandwidth (Again) • minimise redundant data transfer – avoid shuffling data – caching • direct connection • compression? 8. The network is homogeneous Devices and connections are not uniform • interoperability! – Java vs. .NET? • route for speed – not hops • load-balancing Eight Fallacies (to avoid) 1. 2. 3. 4. 5. 6. 7. 8. Severity of fallacies vary in different scenarios! Which fallacies apply/do not apply for: The network is reliable Latency is zero Bandwidth is infinite • The network is secure • Topology doesn’t change • There is one administrator • Transport cost is zero The network is homogeneous Gigabit ethernet LAN? BitTorrent The Web Laboratorio II LIMITATIONS OF DISTRIBUTED COMPUTING: CAP THEOREM But first … ACID Have you heard of ACID guarantees in a database class? For traditional (non-distributed) databases … 1. Atomicity: – Transactions all or nothing: fail cleanly 2. Consistency: – Doesn’t break constraints/rules 3. Isolation: – Parallel transactions act as if sequential 4. Durability – System remembers changes What is CAP? Three guarantees a distributed sys. could make 1. Consistency: – All nodes have a consistent view of the system 2. Availability: – Every read/write is acted upon 3. Partition-tolerance: – The system works even if messages are lost A Distributed System (Replication) – – – – Consistency – – – – There’s 891 users in ‘M’ There’s 891 users in ‘M’ Availability 891 How many users start with ‘M’ – – – – Partition-Tolerance How many users start with ‘M’ – – 891 – – The CAP Question Can a distributed system guarantee consistency (all nodes have the same up-to-date view), availability (every read/write is acted upon) and partition-tolerance (the system works even if messages are lost) at the same time? What do you think? Can a distributed system guarantee consistency and availability and partition-tolerance at the same time, or not? The CAP Answer The CAP “Proof” How many users start with ‘M’ – 891 There’s 891 users in ‘M’ – – – 892 There’s 891 users in ‘M’ The Cap “Proof” (in boring words) • Consider machines m1 and m2 on either side of a partition: – If an update is allowed on m2 (Availability), then m1 cannot see the change: (loses Consistency) – To make sure that m1 and m2 have the same, upto-date view (Consistency), neither m1 nor m2 can accept any requests/updates (lose Availability) – Thus, only when m1 and m2 can communicate (lose Partition tolerance) can Availability and Consistency be guaranteed The CAP Theorem A distributed system cannot guarantee consistency (all nodes have the same up-to-date view), availability (every read/write is acted upon) and partitiontolerance (the system works even if messages are lost) at the same time. (“Proof” as shown on previous slide ) The CAP Triangle C Choose Two A P CAP Systems CA: Guarantees to give a CP: Guarantees responses correct response but only while network works fine (Centralised / Traditional) are correct even if there are network failures, but response may fail (Weak availability) C A P AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency) (No intersection) 892 CA System How many users start with ‘M’ – – – 891 There’s 892 users in ‘M’ – 892 There’s 891 users in ‘M’ 891 CP System How many users start with ‘M’ – There’s 891 users in ‘M’ – – – There’s 891 users in ‘M’ 891 AP System How many users start with ‘M’ – There’s 891 users in ‘M’ – – – 892 There’s 891 users in ‘M’ BASE (AP) • Basically Available – Pretty much always “up” • Soft State – Replicated, cached data • Eventual Consistency – Stale data tolerated, for a while • Amazon, eBay, Google, DNS … The CAP Theorem • C,A in CAP ≠ C,A in ACID • Simplified model – Partitions are rare – Systems may be a mix of CA/CP/AP – C/A/P often continuous in reality! • But concept useful/frequently discussed: – How to handle Partitions? • Availability? or • Consistency? LABS PREP: AIDAN LEARNS SPANISH  Help me learn Spanish! What are the top 500 most common words in Spanish Word Count Help me learn Spanish! How should we design the distributed system? (for now it will be in-memory) • How can we distribute the word count? • How can we call the machines / send the data? • How can we merge the word counts? • How to implement in the lab? RECAP Distributed Systems have limitations • Eight fallacies and what they mean 1. 2. 3. 4. 5. 6. 7. 8. The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous Distributed Systems have limitations CAP Theorem A distributed system cannot guarantee consistency (all nodes have the same up-to-date view and will give a correct answer), availability (every request is acted upon) and partition-tolerance (the system works even if messages are lost) at the same time. CAP Systems CA: Guarantees to give a CP: Guarantees responses correct response but only while network works fine (Centralised / Traditional) are correct even if there are network failures, but response may fail (Weak availability) C A P AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency) (No intersection) Design of a Distributed Algorithm • How to distribute/split data for processing • Embarrassingly parallel execution • How to merge data (naively for now) • How to help me learn Spanish Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Z - Aidan Hogan