Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maelstrom Ricochet Conclusion Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Datacenters I Internet Services (90s) — Websites, Search, Online Stores I Since then: # of low-end volume servers 30 Installed Server Base 00-05: Millions 25 20 15 10 5 I Commodity — up by 100% I High/Mid — down by 40% 0 2000 2001 2002 2003 2004 2005 I Today: Datacenters are ubiquitous I How have they evolved? Data partially sourced from IDC press releases (www.idc.com) Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Networks of Datacenters RTT: 110 ms Why? Business Continuity, Client Locality, Distributed Datasets or Operations ... Any modern enterprise! 100 ms 200 ms 220 ms 100 ms 110 ms 210 ms N W E S Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Networks of Real-Time Datacenters I Finance, Aerospace, Military, Search and Rescue... I ... documents, chat, email, games, videos, photos, blogs, social networks I The Datacenter is the Computer! I Not hard real-time: real fast, highly responsive, time-critical Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Networks of Real-Time Datacenters I Finance, Aerospace, Military, Search and Rescue... I ... documents, chat, email, games, videos, photos, blogs, social networks I The Datacenter is the Computer! I Not hard real-time: real fast, highly responsive, time-critical Gartner Survey: I Real-Time Infrastructure (RTI): reaction time in secs/mins I 73%: RTI is important or very important I 85%: Have no RTI capability Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion The Real-Time Datacenter — Systems Challenges How do we recover from failures within seconds? ld, or e -W m al l-Ti e R ea R Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion The Real-Time Datacenter — Systems Challenges Software Stack How do we recover from failures within seconds? Crashes Overloads Bugs Exploits Disk Failure, Disasters Packet Loss Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... ld, or e -W m al l-Ti e R ea R Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion The Real-Time Datacenter — Systems Challenges Software Stack How do we recover from failures within seconds? Crashes Overloads Bugs Exploits Tempest KyotoFS Disk Failure, Disasters Packet Loss Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... SMFS Plato Ricochet Maelstrom ld, or e -W m al l-Ti e R ea R Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Reliable Communication Goal: Recover lost packets fast! I Existing protocols react to loss: too much, too late I We want proactive recovery: stable overhead, low latencies I Maelstrom: Reliability between datacenters [NSDI 2008] I Ricochet: Reliability within datacenters [NSDI 2007] Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Reliable Communication between Datacenters TCP fails in three ways: 1. Throughput Collapse 100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps! 2. Massive Buffers required for High-Rate Traffic 3. Recovery Delays for Time-Critical Traffic Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Reliable Communication between Datacenters TCP fails in three ways: 1. Throughput Collapse 100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps! 2. Massive Buffers required for High-Rate Traffic 3. Recovery Delays for Time-Critical Traffic Current Solutions: I Rewrite Apps: One Flow → Multiple Split Flows I Resize Buffers I Spend (infinite) money! Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion TeraGrid: Supercomputer Network Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion TeraGrid: Supercomputer Network I End-to-End UDP Probes: Zero Congestion, Non-Zero Loss! Possible Reasons: I I I I I I I I transient congestion degraded fiber malfunctioning HW misconfigured HW switching contention low receiver power end-host overflow ... 30 25 % of Measurements I 24 20 15 14 10 5 0 0.01 0.03 0.05 0.07 0.1 0.3 0.5 % of Lost Packets Mahesh Balakrishnan Reliable Communication for Datacenters 0.7 1 Maelstrom Ricochet Conclusion TeraGrid: Supercomputer Network I End-to-End UDP Probes: Zero Congestion, Non-Zero Loss! Possible Reasons: I I I I I I I I 30 25 % of Measurements I transient congestion degraded fiber malfunctioning HW misconfigured HW switching contention low receiver power end-host overflow ... 24 20 15 14 10 5 0 0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1 % of Lost Packets Electronics: Cluttered Pathways Mahesh Balakrishnan Optics: Lossy Fiber Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Problem Statement Run unmodified TCP/IP over lossy high-speed long-distance networks Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion The Maelstrom Network Appliance Router Sending End-hosts Commodity TCP Router Packet Loss Mahesh Balakrishnan Receiving End-hosts Commodity TCP Reliable Communication for Datacenters Maelstrom Ricochet Conclusion The Maelstrom Network Appliance Sending End-hosts Commodity TCP Maelstrom Send-Side Appliance Maelstrom Receive-Side Appliance Router Router Packet Loss Receiving End-hosts Commodity TCP Transparent: No modification to end-host or network Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion The Maelstrom Network Appliance Maelstrom Send-Side Appliance Maelstrom Receive-Side Appliance FEC Sending End-hosts Commodity TCP Encode Decode Router Router Packet Loss Receiving End-hosts Commodity TCP Transparent: No modification to end-host or network FEC = Forward Error Correction Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion What is FEC? A B C D E X X X 3 repair packets from every 5 data packets C D E A B Receiver can recover from any 3 lost packets Rate : (r , c) — c repair packets for every r data packets. I Pro: Recovery Latency independent of RTT I Constant Data Overhead: I Packet-level FEC at End-hosts: Inexpensive, No extra HW Mahesh Balakrishnan c r +c Reliable Communication for Datacenters Maelstrom Ricochet Conclusion What is FEC? A B C D E X X X 3 repair packets from every 5 data packets C D E A B Receiver can recover from any 3 lost packets Rate : (r , c) — c repair packets for every r data packets. I Pro: Recovery Latency independent of RTT I Constant Data Overhead: I Packet-level FEC at End-hosts: Inexpensive, No extra HW I Con: Recovery Latency dependent on channel data rate Mahesh Balakrishnan c r +c Reliable Communication for Datacenters Maelstrom Ricochet Conclusion What is FEC? A B C D E X X X 3 repair packets from every 5 data packets C D E A B Receiver can recover from any 3 lost packets Rate : (r , c) — c repair packets for every r data packets. I Pro: Recovery Latency independent of RTT I Constant Data Overhead: I Packet-level FEC at End-hosts: Inexpensive, No extra HW I Con: Recovery Latency dependent on channel data rate I FEC in the Network: I c r +c Where and What? Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion The Maelstrom Network Appliance Maelstrom Send-Side Appliance Maelstrom Receive-Side Appliance FEC Sending End-hosts Commodity TCP Encode Decode Router Router Packet Loss Receiving End-hosts Commodity TCP Transparent: No modification to end-host or network FEC = Forward Error Correction Where: at the appliance, What: aggregated data Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Maelstrom Mechanism Send-Side Appliance: 28 27 26 25 28 29 ‘Recipe List’: 25,26,27,28,29 27 XOR LOSS 25 26 Create repair packet = XOR + ‘recipe’ of data packet IDs 29 X Snoop IP packets I Appliance LAN MTU Lambda Jumbo MTU I Recovered Packet 25 Mahesh Balakrishnan 26 28 Appliance 29 27 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Maelstrom Mechanism Send-Side Appliance: I Mahesh Balakrishnan 25 28 29 ‘Recipe List’: 25,26,27,28,29 XOR Lost packet recovered using XOR and other data packets At receiver end-host: out of order, no loss 26 LOSS 25 I 27 27 Receive-Side Appliance: 28 26 Create repair packet = XOR + ‘recipe’ of data packet IDs 29 X Snoop IP packets I Appliance LAN MTU Lambda Jumbo MTU I Recovered Packet 25 26 28 Appliance 29 27 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Layered Interleaving for Bursty Loss Recovery Latency ∝ Actual Burst Size, not Max Burst Size 201 101 21 11 3 2 1 Data Stream XORs: X3 X2 I XORs at different interleaves I Recovery latency degrades gracefully with loss burstiness: X1 catches random singleton losses X2 catches loss bursts of 10 or less X3 catches bursts of 100 or less Mahesh Balakrishnan Reliable Communication for Datacenters X1 2 Maelstrom Ricochet Conclusion Layered Interleaving for Bursty Loss Recovery Latency ∝ Actual Burst Size, not Max Burst Size 201 101 21 11 3 2 1 Data Stream XORs: X3 X2 X1 I XORs at different interleaves I Recovery latency degrades gracefully with loss burstiness: X1 catches random singleton losses X2 catches loss bursts of 10 or less X3 catches bursts of 100 or less Recovery probability 1 Reed−Solomon Maelstrom 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 Loss probability 0.8 Comparison of Recovery Probability: r=7, c=2 Mahesh Balakrishnan Reliable Communication for Datacenters 1 2 Maelstrom Ricochet Conclusion Maelstrom Modes I TCP Traffic: Two Flow Control Modes End-Host Appliance Appliance End-Host A) End-to-End Flow Control End-Host Appliance Appliance End-Host B) Split Flow Control I Split Mode avoids client buffer resizing (PeP) Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Implementation Details I In Kernel — Linux 2.6.20 Module I Commodity Box: 3 Ghz, 1 Gbps NIC (≈ 800$) I Max speed: 1 Gbps, Memory Footprint: 10 MB I 50-60% CPU → NIC is the bottleneck (for c = 3) I How do we efficiently store/access/clean a gigabit of data every second? I Scaling to Multi-Gigabit: Partition IP space across proxies Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: FEC Mode and Loss Claim: Maelstrom effectively hides loss from TCP/IP Throughput (Mbps) TCP/IP without loss ↓ 1000 900 800 700 600 500 400 300 200 100 0 0 TCP (No loss) TCP (0.1% loss) TCP (1% loss) ↑ 50 100 150 200 One Way Link Latency (ms) 250 TCP/IP with loss Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: FEC Mode and Loss Claim: Maelstrom effectively hides loss from TCP/IP Throughput (Mbps) TCP/IP without loss ↓ 1000 900 800 700 600 500 400 300 200 100 0 0 TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%) ↑ 50 100 150 200 One Way Link Latency (ms) 250 TCP/IP with loss Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: FEC Mode and Loss Claim: Maelstrom effectively hides loss from TCP/IP Data + FEC =˜ 1 Gbps Throughput (Mbps) TCP/IP without loss ↓ →{ 1000 900 800 700 600 500 400 300 200 100 0 0 TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%) ↑ 50 100 150 200 One Way Link Latency (ms) 250 TCP/IP with loss Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: FEC Mode and Loss Claim: Maelstrom effectively hides loss from TCP/IP Data + FEC =˜ 1 Gbps Throughput (Mbps) TCP/IP without loss ↓ →{ 1000 900 800 700 600 500 400 300 200 100 0 TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%) ← 0 ↑ 50 100 150 200 One Way Link Latency (ms) Tput ∝ 250 TCP/IP with loss Mahesh Balakrishnan Reliable Communication for Datacenters 1 RTT Maelstrom Ricochet Conclusion Evaluation: Split Mode and Buffering Claim: Maelstrom eliminates the need for large end-host buffers Throughput as a function of latency 600 Tput independent of RTT! Throughput (Mbps) 500 400 300 200 100 loss-rate=0.001 0 20 30 40 50 60 70 80 One-way link latency (ms) Mahesh Balakrishnan 90 100 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: Delivery Latency Claim: Maelstrom eliminates TCP/IP’s loss-related jitter 600 TCP/IP: 0.1% Loss Delivery Latency (ms) Delivery Latency (ms) 600 500 400 300 200 100 0 Maelstrom: 0.1% Loss 500 400 300 200 100 0 0 1000 2000 3000 4000 5000 6000 Packet # 0 1000 2000 3000 4000 5000 6000 Packet # Sources of Jitter: I Receive-side buffering due to sequencing I Send-side buffering due to congestion control Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: Layered Interleaving Claim: Recovery Latency depends on Actual Burst Length I 0 50 100 150 200 Recovery Latency (ms) 100 80 60 40 20 0 40 % Recovered 100 80 60 40 20 0 20 % Recovered % Recovered Burst Length = 1 0 50 100 150 200 Recovery Latency (ms) 100 80 60 40 20 0 0 50 100 150 200 Recovery Latency (ms) Longer Burst Lengths → Longer Recovery Latency Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Next Step: SMFS - The Smoke and Mirrors Filesystem I Classic Mirroring Trade-off: I I Fast — return to user after sending to mirror Safe — return to user after ACK from mirror I SMFS — return to user after sending enough FEC I Maelstrom: Lossy Network → Lossless Network → Disk! I Result: Fast, Safe Mirroring independent of link length! I General Principle: Gray-box Exposure of Protocol State Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Software Stack The Big Picture Crashes Overloads Bugs Exploits Tempest KyotoFS Disk Failure, Disasters Packet Loss Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... SMFS Plato Ricochet Maelstrom ld, or e -W m al l-Ti e R ea R Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation From Long-Haul to Multicast Between Datacenters High RTT Data acks Sender I Receiver Feedback Loop Infeasible: I Inter-Datacenter Long-Haul: RTT too high Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation From Long-Haul to Multicast Within Datacenter Between Datacenters Many Receivers ack s High RTT Dat a Data Data acks a Da t Sender s ack Multicast Group I Receiver Feedback Loop Infeasible: I I Inter-Datacenter Long-Haul: RTT too high Intra-Cluster Multicast: Too many receivers Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation How is Multicast Used? service replication/partitioning, publish-subscribe, data caching... Financial Pub-Sub Example: I Each equity is mapped to a multicast group I Each node is interested in a different set of equities ... Mahesh Balakrishnan Tracking Portfolio Tracking S&P 500 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation How is Multicast Used? service replication/partitioning, publish-subscribe, data caching... Financial Pub-Sub Example: I Each equity is mapped to a multicast group I Each node is interested in a different set of equities ... Tracking Portfolio Tracking S&P 500 Each node in many groups =⇒ Low per-group data rate Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation How is Multicast Used? service replication/partitioning, publish-subscribe, data caching... Financial Pub-Sub Example: I Each equity is mapped to a multicast group I Each node is interested in a different set of equities ... Tracking Portfolio Tracking S&P 500 Each node in many groups =⇒ Low per-group data rate High per-node data rate =⇒ Overload Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Where does loss occur in a Datacenter? Packet Loss occurs at end-hosts: independent and bursty Overloaded Node burst length (packets) 100 receiver r1: loss bursts in A and B receiver r1: data bursts in A and B 80 60 40 20 0 60 80 100 Less Loaded Node burst length (packets) 100 120 140 ← ← 160 180 receiver r2: loss bursts in A receiver r2: data bursts in A 80 60 40 20 0 60 80 100 Mahesh Balakrishnan 120 time in seconds 140 160 Reliable Communication for Datacenters 180 Data Rate Loss Bursts Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Problem Statement I I Recover lost packets rapidly! Scalability: Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Problem Statement I I Recover lost packets rapidly! Scalability: I Multicast Data Number of Receivers Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Problem Statement I I Recover lost packets rapidly! Scalability: I I Number of Receivers Number of Senders Mahesh Balakrishnan Multicast Data Lost Packet Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Problem Statement I I Recover lost packets rapidly! Scalability: I I I Number of Receivers Number of Senders Number of Groups Mahesh Balakrishnan Multicast Data Lost Packet Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Design Space for Reliable Multicast How does latency scale? Loss data Sender Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Design Space for Reliable Multicast How does latency scale? I 1. acks: implosion Loss data Sender ack s Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Design Space for Reliable Multicast How does latency scale? s nak Loss I 1. acks: implosion I 2. naks data Sender Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Design Space for Reliable Multicast How does latency scale? s xor Loss data Sender Mahesh Balakrishnan I 1. acks: implosion I 2. naks I 3. Sender-based FEC 1 recovery latency ∝ datarate data rate: one sender, one group Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Design Space for Reliable Multicast How does latency scale? s xor Loss data I 1. acks: implosion I 2. naks I 3. Sender-based FEC 1 recovery latency ∝ datarate data rate: one sender, one group I FEC in the network: Where and What? Sender Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Design Space for Reliable Multicast How does latency scale? Loss xors data I 1. acks: implosion I 2. naks I 3. Sender-based FEC 1 recovery latency ∝ datarate data rate: one sender, one group I FEC in the network: Where and What? Sender Receiver-based FEC: at receivers, from incoming data Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Receiver-Based Forward Error Correction I Receiver generates an XOR of r incoming multicast packets and exchanges with other receivers I Each XOR sent to c other random receivers I Rate: (r , c) Mahesh Balakrishnan Multicast Data S E N D E R S Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Receiver-Based Forward Error Correction I Receiver generates an XOR of r incoming multicast packets and exchanges with other receivers I Each XOR sent to c other random receivers I Rate: (r , c) I 1 latency ∝ P datarate s data rate: across all senders, in a single group Mahesh Balakrishnan Multicast Data S E N D E R S Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Lateral Error Correction: Principle Receiver R1 Receiver R2 Group B A1 A1 A2 A3 A2 A3 B1 B1 B2 Loss B2 A4 A4 B3 B3 B4 A5 A6 B5 Repair Pack et I: (A1,A2,A3,A 4,A5) Repair Pack et (B1,B2,B3,B II: 4,B5) B4 A5 INCOMING DATA PACKETS INCOMING DATA PACKETS Group A I Single-Group RFEC A6 B5 Recovery Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Lateral Error Correction: Principle Receiver R1 Receiver R2 Group B A1 A1 A2 A3 A2 A3 B1 B1 B2 A4 B3 Loss Repair Pack et I: (A1,A2,A3,B 1,B2) Recovery B4 B5 A4 B3 B4 A5 A5 A6 B2 Repair Pack et (A4,B3,B4,A II: 5,A6) INCOMING DATA PACKETS INCOMING DATA PACKETS Group A I I Single-Group RFEC Lateral Error Correction I I Create XORs from multiple groups → faster recovery! What about complex overlap? A6 B5 Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Nodes and Disjoint Regions I 2 3 Receiver n1 belongs to groups A, B, and C 1 Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Nodes and Disjoint Regions I Receiver n1 belongs to groups A, B, and C I Divides groups into disjoint regions 2 3 1 Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Nodes and Disjoint Regions 4 I Receiver n1 belongs to groups A, B, and C I Divides groups into disjoint regions I Is unaware of groups it does not belong to (D) 2 3 I 1 Works with any conventional Group Membership Service Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Regional Selection A A 1 Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Regional Selection I Select targets for XORs from regions, not groups A A 1 Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Regional Selection cAab 1.0 c ab A A a I Select targets for XORs from regions, not groups I From each region, select proportional fraction of |x| · cA cA : cAx = |A| 1 ac A Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Regional Selection cAab 1.0 c ab A A a I Select targets for XORs from regions, not groups I From each region, select proportional fraction of |x| · cA cA : cAx = |A| 1 ac A latency ∝ P P 1 s g datarate data rate: across all senders, in intersections of groups Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Scalability in Groups Claim: Ricochet scales to hundreds of groups. 50000 Recovery % 98 96 94 92 90 Average Recovery Latency 40000 Microseconds LEC Recovery Percentage 100 30000 20000 10000 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) 0 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) Comparision: At 128 groups, NAK/SFEC latency is 8 seconds. Ricochet is 400 times faster! Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Distribution of Recovery Latency Claim: Ricochet is reliable and time-critical 96.8% LEC + 3.2% NAK 92% LEC + 8% NAK 30 20 10 0 50 100 150 200 Milliseconds 250 (a) 10% Loss Rate 50 Recovery Percentage 40 0 84% LEC + 16% NAK 50 Recovery Percentage Recovery Percentage 50 40 30 20 10 0 0 50 100 150 200 Milliseconds 250 (b) 15% Loss Rate 40 LEC ↓ 30 20 NAK ↓ 10 0 0 50 100 150 200 Milliseconds (c) 20% Loss Rate Most lost packets recovered < 50ms by LEC. Remainder via reactive NAKs. Bursty Loss: 100 packet burst → 90% recovered at 50 ms avg Mahesh Balakrishnan Reliable Communication for Datacenters 250 Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Next Step: Dr. Multicast I IP Multicast has a bad reputation! I I Unscalable filtering at routers/switches/NICs Insecure Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation Next Step: Dr. Multicast I IP Multicast has a bad reputation! I I I Unscalable filtering at routers/switches/NICs Insecure Insight: IP Multicast is a shared, controlled resource I I I I Transparent interception of socket system calls Logical address → Set of network (uni/multi)cast addresses Enforcement of IP Multicast policies Gossip-based tracking of membership/mappings Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Software Stack The Big Picture Crashes Overloads Bugs Exploits Tempest DSN 2008 KyotoFS Disk Failure, Disasters SMFS HotOS XI Submitted Plato SRDS 2006 Packet Loss Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Maelstrom NSDI 2007 NSDI 2008 ld, or e -W m Mistral al l-Ti e a MobiHoc R 2006 Re Mahesh Balakrishnan Sequoia PODC 2007 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Software Stack The Big Picture Crashes Overloads Bugs Exploits Tempest DSN 2008 What about new abstractions? KyotoFS Disk Failure, Disasters SMFS HotOS XI Submitted Plato SRDS 2006 Packet Loss Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Maelstrom NSDI 2007 NSDI 2008 ld, or e -W m Mistral al l-Ti e a MobiHoc R 2006 Re Mahesh Balakrishnan Sequoia PODC 2007 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Service Future Work Invocation (ID …) Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Partition I (ID 2… Partition for Scalability ) (ID1 …) 0… (ID ) Instance Instance Instance Future Work Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Group (ID 2… ) I Partition for Scalability I Replicate for Availability / Fault-Tolerance (ID1 …) 0… (ID ) Group Partition Group Future Work Replicate Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Future Work Group The Datacenter is the Computer Group Group Partition Rethink Old Abstractions (ID Processes, Threads, 2… Address Space, Protection, ) (ID1 …) Locks, IPC/RPC, ) 0… Sockets, Files... (ID Invent New Abstractions! Replicate Mahesh Balakrishnan I Partition for Scalability I Replicate for Availability / Fault-Tolerance I The DatacenterOS Concerns: Performance, Parallelism, Privacy, Power... Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Conclusion I The Real-Time Datacenter I I Recover from failures within seconds Reliable Communication: FEC in the Network I I I Recover lost packets in milliseconds Maelstrom: between Datacenters Ricochet: within Datacenters Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Conclusion I The Real-Time Datacenter I I Recover from failures within seconds Reliable Communication: FEC in the Network I I I Recover lost packets in milliseconds Maelstrom: between Datacenters Ricochet: within Datacenters Thank You! Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Extra Slide: FEC and Bursty Loss A I Existing solution: interleaving I Interleave i and rate (r , c) tolerates (c ∗ i) burst... I ...with i times the latency Mahesh Balakrishnan B C D E F G H A C E G I X X X B D F H J X X X I J Figure: Interleave of 2 — Even and Odd packets encoded separately Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Extra Slide: FEC and Bursty Loss A I Existing solution: interleaving I Interleave i and rate (r , c) tolerates (c ∗ i) burst... I ...with i times the latency B C D E F G H A C E G I X X X B D F H J X X X I J Figure: Interleave of 2 — Even and Odd packets encoded separately Wanted: Graceful degradation of recovery latency with actual burst size for constant overhead Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Extra Slide: Maelstrom Evaluation Maelstrom goodput is near theoretical maximum 1000 theoretical goodput M-FEC goodput Throughput (Mbits/sec) 900 800 700 600 500 400 300 200 100 0 0 1 Mahesh Balakrishnan 2 3 FEC Layers (c) 4 5 6 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Recovery Latency (Milliseconds) Extra Slide: Layered Interleaving 100 Reed Solomon Layered Interleaving 80 60 40 20 0 0 200 400 600 800 Recovered Packet # Mahesh Balakrishnan 1000 Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: Layered Interleaving Claim: Recovery Latency depends on Actual Burst Length I 0 50 100 150 200 Recovery Latency (ms) 100 80 60 40 20 0 40 % Recovered 100 80 60 40 20 0 20 % Recovered % Recovered Burst Length = 1 0 50 100 150 200 Recovery Latency (ms) 100 80 60 40 20 0 0 50 100 150 200 Recovery Latency (ms) Longer Burst Lengths → Longer Recovery Latency Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion TeraGrid: Supercomputer Network I End-to-End UDP Probes: Zero Congestion, Non-Zero Loss! Possible Reasons: I I I I I I I I transient congestion degraded fiber malfunctioning HW misconfigured HW switching contention low receiver power end-host overflow ... 30 25 % of Measurements I 24 20 15 14 10 5 0 0.01 0.03 0.05 0.07 0.1 0.3 0.5 % of Lost Packets Mahesh Balakrishnan Reliable Communication for Datacenters 0.7 1 Maelstrom Ricochet Conclusion TeraGrid: Supercomputer Network I End-to-End UDP Probes: Zero Congestion, Non-Zero Loss! Possible Reasons: I I I I I I I I transient congestion degraded fiber malfunctioning HW misconfigured HW switching contention low receiver power end-host overflow ... 30 25 24 All Measurements Without Indiana U. Site % of Measurements I 20 15 14 14 10 5 3 0 0.01 0.03 0.05 0.07 0.1 0.3 0.5 % of Lost Packets Mahesh Balakrishnan Reliable Communication for Datacenters 0.7 1 Maelstrom Ricochet Conclusion TeraGrid: Supercomputer Network I End-to-End UDP Probes: Zero Congestion, Non-Zero Loss! Possible Reasons: I I I I I I I I transient congestion degraded fiber malfunctioning HW misconfigured HW switching contention low receiver power end-host overflow ... 30 25 24 All Measurements Without Indiana U. Site % of Measurements I 20 15 14 14 10 5 3 0 0.01 0.03 0.05 0.07 0.1 0.3 0.5 % of Lost Packets Problem Statement: Run unmodified TCP/IP over lossy high-speed long-distance networks Mahesh Balakrishnan Reliable Communication for Datacenters 0.7 1 Maelstrom Ricochet Conclusion Evaluation: Split Mode and Buffering Claim: Maelstrom eliminates the need for large end-host buffers 600 throughput 0.1% loss Throughput (Mbits/sec) 500 400 300 200 100 0 client-buf M-ALL M-FEC M-BUF RTT 100ms Mahesh Balakrishnan tcp Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: Split Mode and Buffering Claim: Maelstrom eliminates the need for large end-host buffers 600 throughput 0.1% loss Throughput (Mbits/sec) 500 ← 400 300 Maelstrom FEC 200 100 0 client-buf M-ALL M-FEC M-BUF RTT 100ms Mahesh Balakrishnan tcp Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: Split Mode and Buffering Claim: Maelstrom eliminates the need for large end-host buffers 600 throughput 0.1% loss Throughput (Mbits/sec) 500 ← 400 300 Maelstrom FEC 200 100 0 ↑ client-buf M-ALL M-FEC M-BUF RTT 100ms tcp Maelstrom FEC + Hand-Tuned Buffers Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: Split Mode and Buffering Claim: Maelstrom eliminates the need for large end-host buffers Maelstrom FEC + Split Mode ↓ 600 throughput 0.1% loss Throughput (Mbits/sec) 500 ← 400 300 Maelstrom FEC 200 100 0 ↑ client-buf M-ALL M-FEC M-BUF RTT 100ms tcp Maelstrom FEC + Hand-Tuned Buffers Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Evaluation: FEC mode and loss Throughput (Mbps) Claim: Maelstrom works at high loss rates 500 450 400 350 300 250 200 150 100 50 0 Maelstrom TCP 0.1 1 Packet Loss Rate % 10 Link RTT = 100ms Mahesh Balakrishnan Reliable Communication for Datacenters