Download pdf

Document related concepts
no text concepts found
Transcript
Maelstrom Ricochet Conclusion
Reliable Communication for Datacenters
Mahesh Balakrishnan
Cornell University
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Datacenters
I
Internet Services (90s) — Websites, Search, Online Stores
I
Since then:
# of low-end volume servers
30
Installed Server Base 00-05:
Millions
25
20
15
10
5
I
Commodity — up by 100%
I
High/Mid — down by 40%
0
2000 2001 2002 2003 2004 2005
I
Today: Datacenters are ubiquitous
I
How have they evolved?
Data partially sourced from IDC press releases (www.idc.com)
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Datacenters
RTT: 110 ms
Why?
Business Continuity,
Client Locality,
Distributed Datasets
or Operations ...
Any modern
enterprise!
100 ms
200 ms
220 ms
100 ms
110 ms
210 ms
N
W
E
S
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Real-Time Datacenters
I
Finance, Aerospace, Military, Search and Rescue...
I
... documents, chat, email, games, videos, photos, blogs,
social networks
I
The Datacenter is the Computer!
I
Not hard real-time: real fast, highly responsive, time-critical
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Real-Time Datacenters
I
Finance, Aerospace, Military, Search and Rescue...
I
... documents, chat, email, games, videos, photos, blogs,
social networks
I
The Datacenter is the Computer!
I
Not hard real-time: real fast, highly responsive, time-critical
Gartner Survey:
I
Real-Time Infrastructure (RTI): reaction time in secs/mins
I
73%: RTI is important or very important
I
85%: Have no RTI capability
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems Challenges
How do we recover from failures within seconds?
ld,
or e
-W m
al l-Ti
e
R ea
R
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems Challenges
Software Stack
How do we recover from failures within seconds?
Crashes
Overloads
Bugs
Exploits
Disk
Failure,
Disasters
Packet
Loss
Compilers, Databases,
Distributed Systems,
Machine Learning,
Filesystems, Operating
Systems, Networking ...
ld,
or e
-W m
al l-Ti
e
R ea
R
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems Challenges
Software Stack
How do we recover from failures within seconds?
Crashes
Overloads
Bugs
Exploits
Tempest
KyotoFS
Disk
Failure,
Disasters
Packet
Loss
Compilers, Databases,
Distributed Systems,
Machine Learning,
Filesystems, Operating
Systems, Networking ...
SMFS
Plato
Ricochet
Maelstrom
ld,
or e
-W m
al l-Ti
e
R ea
R
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication
Goal: Recover lost packets fast!
I
Existing protocols react to loss: too much, too late
I
We want proactive recovery: stable overhead, low latencies
I
Maelstrom: Reliability between datacenters
[NSDI 2008]
I
Ricochet: Reliability within datacenters
[NSDI 2007]
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication between Datacenters
TCP fails in three ways:
1. Throughput Collapse
100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps!
2. Massive Buffers required for High-Rate Traffic
3. Recovery Delays for Time-Critical Traffic
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication between Datacenters
TCP fails in three ways:
1. Throughput Collapse
100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps!
2. Massive Buffers required for High-Rate Traffic
3. Recovery Delays for Time-Critical Traffic
Current Solutions:
I
Rewrite Apps: One Flow → Multiple Split Flows
I
Resize Buffers
I
Spend (infinite) money!
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
I
End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
Possible Reasons:
I
I
I
I
I
I
I
I
transient congestion
degraded fiber
malfunctioning HW
misconfigured HW
switching contention
low receiver power
end-host overflow
...
30
25
% of Measurements
I
24
20
15
14
10
5
0
0.01 0.03 0.05 0.07 0.1
0.3
0.5
% of Lost Packets
Mahesh Balakrishnan
Reliable Communication for Datacenters
0.7
1
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
I
End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
Possible Reasons:
I
I
I
I
I
I
I
I
30
25
% of Measurements
I
transient congestion
degraded fiber
malfunctioning HW
misconfigured HW
switching contention
low receiver power
end-host overflow
...
24
20
15
14
10
5
0
0.01 0.03 0.05 0.07 0.1
0.3
0.5
0.7
1
% of Lost Packets
Electronics: Cluttered Pathways
Mahesh Balakrishnan
Optics: Lossy Fiber
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Problem Statement
Run unmodified TCP/IP over lossy high-speed
long-distance networks
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Router
Sending End-hosts
Commodity TCP
Router
Packet Loss
Mahesh Balakrishnan
Receiving End-hosts
Commodity TCP
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Sending End-hosts
Commodity TCP
Maelstrom
Send-Side
Appliance
Maelstrom
Receive-Side
Appliance
Router
Router
Packet Loss
Receiving End-hosts
Commodity TCP
Transparent: No modification to end-host or network
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Maelstrom
Send-Side
Appliance
Maelstrom
Receive-Side
Appliance
FEC
Sending End-hosts
Commodity TCP
Encode
Decode
Router
Router
Packet Loss
Receiving End-hosts
Commodity TCP
Transparent: No modification to end-host or network
FEC = Forward Error Correction
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X
3 repair packets from
every 5 data packets
C D E A B
Receiver can recover
from any 3 lost packets
Rate : (r , c) — c repair packets for every r data packets.
I
Pro: Recovery Latency independent of RTT
I
Constant Data Overhead:
I
Packet-level FEC at End-hosts: Inexpensive, No extra HW
Mahesh Balakrishnan
c
r +c
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X
3 repair packets from
every 5 data packets
C D E A B
Receiver can recover
from any 3 lost packets
Rate : (r , c) — c repair packets for every r data packets.
I
Pro: Recovery Latency independent of RTT
I
Constant Data Overhead:
I
Packet-level FEC at End-hosts: Inexpensive, No extra HW
I
Con: Recovery Latency dependent on channel data rate
Mahesh Balakrishnan
c
r +c
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X
3 repair packets from
every 5 data packets
C D E A B
Receiver can recover
from any 3 lost packets
Rate : (r , c) — c repair packets for every r data packets.
I
Pro: Recovery Latency independent of RTT
I
Constant Data Overhead:
I
Packet-level FEC at End-hosts: Inexpensive, No extra HW
I
Con: Recovery Latency dependent on channel data rate
I
FEC in the Network:
I
c
r +c
Where and What?
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Maelstrom
Send-Side
Appliance
Maelstrom
Receive-Side
Appliance
FEC
Sending End-hosts
Commodity TCP
Encode
Decode
Router
Router
Packet Loss
Receiving End-hosts
Commodity TCP
Transparent: No modification to end-host or network
FEC = Forward Error Correction
Where: at the appliance, What: aggregated data
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Maelstrom Mechanism
Send-Side Appliance:
28
27
26
25
28
29
‘Recipe List’:
25,26,27,28,29
27
XOR
LOSS
25
26
Create repair packet =
XOR + ‘recipe’ of data
packet IDs
29
X
Snoop IP packets
I
Appliance
LAN MTU
Lambda Jumbo MTU
I
Recovered
Packet
25
Mahesh Balakrishnan
26
28
Appliance
29
27
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Maelstrom Mechanism
Send-Side Appliance:
I
Mahesh Balakrishnan
25
28
29
‘Recipe List’:
25,26,27,28,29
XOR
Lost packet recovered
using XOR and other
data packets
At receiver end-host: out
of order, no loss
26
LOSS
25
I
27
27
Receive-Side Appliance:
28
26
Create repair packet =
XOR + ‘recipe’ of data
packet IDs
29
X
Snoop IP packets
I
Appliance
LAN MTU
Lambda Jumbo MTU
I
Recovered
Packet
25
26
28
Appliance
29
27
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Layered Interleaving for Bursty Loss
Recovery Latency ∝ Actual Burst Size, not Max Burst Size
201
101
21
11
3
2
1
Data Stream
XORs:
X3
X2
I
XORs at different interleaves
I
Recovery latency degrades gracefully
with loss burstiness:
X1 catches random singleton losses
X2 catches loss bursts of 10 or less
X3 catches bursts of 100 or less
Mahesh Balakrishnan
Reliable Communication for Datacenters
X1
2
Maelstrom Ricochet Conclusion
Layered Interleaving for Bursty Loss
Recovery Latency ∝ Actual Burst Size, not Max Burst Size
201
101
21
11
3
2
1
Data Stream
XORs:
X3
X2
X1
I
XORs at different interleaves
I
Recovery latency degrades gracefully
with loss burstiness:
X1 catches random singleton losses
X2 catches loss bursts of 10 or less
X3 catches bursts of 100 or less
Recovery probability
1
Reed−Solomon
Maelstrom
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Loss probability
0.8
Comparison of Recovery
Probability: r=7, c=2
Mahesh Balakrishnan
Reliable Communication for Datacenters
1
2
Maelstrom Ricochet Conclusion
Maelstrom Modes
I
TCP Traffic: Two Flow Control Modes
End-Host
Appliance
Appliance
End-Host
A) End-to-End Flow Control
End-Host
Appliance
Appliance
End-Host
B) Split Flow Control
I
Split Mode avoids client buffer resizing (PeP)
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Implementation Details
I
In Kernel — Linux 2.6.20 Module
I
Commodity Box: 3 Ghz, 1 Gbps NIC (≈ 800$)
I
Max speed: 1 Gbps, Memory Footprint: 10 MB
I
50-60% CPU → NIC is the bottleneck (for c = 3)
I
How do we efficiently store/access/clean a gigabit of data
every second?
I
Scaling to Multi-Gigabit: Partition IP space across proxies
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
Throughput (Mbps)
TCP/IP without loss
↓
1000
900
800
700
600
500
400
300
200
100
0
0
TCP (No loss)
TCP (0.1% loss)
TCP (1% loss)
↑
50
100
150
200
One Way Link Latency (ms)
250
TCP/IP with loss
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
Throughput (Mbps)
TCP/IP without loss
↓
1000
900
800
700
600
500
400
300
200
100
0
0
TCP (No loss)
TCP (0.1% loss)
TCP (1% loss)
Maelstrom (No loss)
Maelstrom (0.1%)
Maelstrom (1%)
↑
50
100
150
200
One Way Link Latency (ms)
250
TCP/IP with loss
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
Data
+ FEC
=˜ 1 Gbps
Throughput (Mbps)
TCP/IP without loss
↓
→{
1000
900
800
700
600
500
400
300
200
100
0
0
TCP (No loss)
TCP (0.1% loss)
TCP (1% loss)
Maelstrom (No loss)
Maelstrom (0.1%)
Maelstrom (1%)
↑
50
100
150
200
One Way Link Latency (ms)
250
TCP/IP with loss
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
Data
+ FEC
=˜ 1 Gbps
Throughput (Mbps)
TCP/IP without loss
↓
→{
1000
900
800
700
600
500
400
300
200
100
0
TCP (No loss)
TCP (0.1% loss)
TCP (1% loss)
Maelstrom (No loss)
Maelstrom (0.1%)
Maelstrom (1%)
←
0
↑
50
100
150
200
One Way Link Latency (ms)
Tput ∝
250
TCP/IP with loss
Mahesh Balakrishnan
Reliable Communication for Datacenters
1
RTT
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
Throughput as a function of latency
600
Tput
independent
of RTT!
Throughput (Mbps)
500
400
300
200
100
loss-rate=0.001
0
20
30
40
50
60
70
80
One-way link latency (ms)
Mahesh Balakrishnan
90
100
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Delivery Latency
Claim: Maelstrom eliminates TCP/IP’s loss-related jitter
600
TCP/IP: 0.1% Loss
Delivery Latency (ms)
Delivery Latency (ms)
600
500
400
300
200
100
0
Maelstrom: 0.1% Loss
500
400
300
200
100
0
0
1000 2000 3000 4000 5000 6000
Packet #
0
1000 2000 3000 4000 5000 6000
Packet #
Sources of Jitter:
I
Receive-side buffering due to sequencing
I
Send-side buffering due to congestion control
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Layered Interleaving
Claim: Recovery Latency depends on Actual Burst Length
I
0
50 100 150 200
Recovery Latency (ms)
100
80
60
40
20
0
40
% Recovered
100
80
60
40
20
0
20
% Recovered
% Recovered
Burst Length = 1
0
50 100 150 200
Recovery Latency (ms)
100
80
60
40
20
0
0
50 100 150 200
Recovery Latency (ms)
Longer Burst Lengths → Longer Recovery Latency
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Next Step: SMFS - The Smoke and Mirrors Filesystem
I
Classic Mirroring Trade-off:
I
I
Fast — return to user after sending to mirror
Safe — return to user after ACK from mirror
I
SMFS — return to user after sending enough FEC
I
Maelstrom: Lossy Network → Lossless Network → Disk!
I
Result: Fast, Safe Mirroring independent of link length!
I
General Principle: Gray-box Exposure of Protocol State
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Software Stack
The Big Picture
Crashes
Overloads
Bugs
Exploits
Tempest
KyotoFS
Disk
Failure,
Disasters
Packet
Loss
Compilers, Databases,
Distributed Systems,
Machine Learning,
Filesystems, Operating
Systems, Networking ...
SMFS
Plato
Ricochet
Maelstrom
ld,
or e
-W m
al l-Ti
e
R ea
R
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
From Long-Haul to Multicast
Between
Datacenters
High RTT
Data
acks
Sender
I
Receiver
Feedback Loop Infeasible:
I
Inter-Datacenter Long-Haul: RTT too high
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
From Long-Haul to Multicast
Within
Datacenter
Between
Datacenters
Many Receivers
ack
s
High RTT
Dat
a
Data
Data
acks
a
Da t
Sender
s
ack
Multicast
Group
I
Receiver
Feedback Loop Infeasible:
I
I
Inter-Datacenter Long-Haul: RTT too high
Intra-Cluster Multicast: Too many receivers
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
How is Multicast Used?
service replication/partitioning, publish-subscribe, data caching...
Financial Pub-Sub Example:
I
Each equity is mapped to
a multicast group
I
Each node is interested
in a different set of
equities ...
Mahesh Balakrishnan
Tracking
Portfolio
Tracking
S&P 500
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
How is Multicast Used?
service replication/partitioning, publish-subscribe, data caching...
Financial Pub-Sub Example:
I
Each equity is mapped to
a multicast group
I
Each node is interested
in a different set of
equities ...
Tracking
Portfolio
Tracking
S&P 500
Each node in many groups
=⇒ Low per-group data rate
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
How is Multicast Used?
service replication/partitioning, publish-subscribe, data caching...
Financial Pub-Sub Example:
I
Each equity is mapped to
a multicast group
I
Each node is interested
in a different set of
equities ...
Tracking
Portfolio
Tracking
S&P 500
Each node in many groups
=⇒ Low per-group data rate
High per-node data rate
=⇒ Overload
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Where does loss occur in a Datacenter?
Packet Loss occurs at end-hosts: independent and bursty
Overloaded
Node
burst length (packets)
100
receiver r1: loss bursts in A and B
receiver r1: data bursts in A and B
80
60
40
20
0
60
80
100
Less
Loaded
Node
burst length (packets)
100
120
140
←
←
160
180
receiver r2: loss bursts in A
receiver r2: data bursts in A
80
60
40
20
0
60
80
100
Mahesh Balakrishnan
120
time in seconds
140
160
Reliable Communication for Datacenters
180
Data
Rate
Loss
Bursts
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Problem Statement
I
I
Recover lost packets
rapidly!
Scalability:
Multicast
Data
Lost
Packet
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Problem Statement
I
I
Recover lost packets
rapidly!
Scalability:
I
Multicast
Data
Number of Receivers
Lost
Packet
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Problem Statement
I
I
Recover lost packets
rapidly!
Scalability:
I
I
Number of Receivers
Number of Senders
Mahesh Balakrishnan
Multicast
Data
Lost
Packet
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Problem Statement
I
I
Recover lost packets
rapidly!
Scalability:
I
I
I
Number of Receivers
Number of Senders
Number of Groups
Mahesh Balakrishnan
Multicast
Data
Lost
Packet
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale?
Loss
data
Sender
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale?
I
1. acks: implosion
Loss
data
Sender
ack
s
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale?
s
nak
Loss
I
1. acks: implosion
I
2. naks
data
Sender
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale?
s
xor
Loss
data
Sender
Mahesh Balakrishnan
I
1. acks: implosion
I
2. naks
I
3. Sender-based FEC
1
recovery latency ∝ datarate
data rate:
one sender, one group
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale?
s
xor
Loss
data
I
1. acks: implosion
I
2. naks
I
3. Sender-based FEC
1
recovery latency ∝ datarate
data rate:
one sender, one group
I
FEC in the network:
Where and What?
Sender
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale?
Loss
xors
data
I
1. acks: implosion
I
2. naks
I
3. Sender-based FEC
1
recovery latency ∝ datarate
data rate:
one sender, one group
I
FEC in the network:
Where and What?
Sender
Receiver-based FEC: at receivers, from incoming data
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Receiver-Based Forward Error Correction
I
Receiver generates an XOR of r
incoming multicast packets and
exchanges with other receivers
I
Each XOR sent to c other
random receivers
I
Rate: (r , c)
Mahesh Balakrishnan
Multicast
Data
S
E
N
D
E
R
S
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Receiver-Based Forward Error Correction
I
Receiver generates an XOR of r
incoming multicast packets and
exchanges with other receivers
I
Each XOR sent to c other
random receivers
I
Rate: (r , c)
I
1
latency ∝ P datarate
s
data rate: across all senders, in
a single group
Mahesh Balakrishnan
Multicast
Data
S
E
N
D
E
R
S
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Lateral Error Correction: Principle
Receiver
R1
Receiver
R2
Group
B
A1
A1
A2
A3
A2
A3
B1
B1
B2
Loss
B2
A4
A4
B3
B3
B4
A5
A6
B5
Repair Pack
et I:
(A1,A2,A3,A
4,A5)
Repair Pack
et
(B1,B2,B3,B II:
4,B5)
B4
A5
INCOMING DATA PACKETS
INCOMING DATA PACKETS
Group
A
I
Single-Group RFEC
A6
B5
Recovery
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Lateral Error Correction: Principle
Receiver
R1
Receiver
R2
Group
B
A1
A1
A2
A3
A2
A3
B1
B1
B2
A4
B3
Loss
Repair Pack
et I:
(A1,A2,A3,B
1,B2)
Recovery
B4
B5
A4
B3
B4
A5
A5
A6
B2
Repair Pack
et
(A4,B3,B4,A II:
5,A6)
INCOMING DATA PACKETS
INCOMING DATA PACKETS
Group
A
I
I
Single-Group RFEC
Lateral Error Correction
I
I
Create XORs from multiple
groups → faster recovery!
What about complex
overlap?
A6
B5
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
I
2
3
Receiver n1 belongs to
groups A, B, and C
1
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
I
Receiver n1 belongs to
groups A, B, and C
I
Divides groups into
disjoint regions
2
3
1
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
4
I
Receiver n1 belongs to
groups A, B, and C
I
Divides groups into
disjoint regions
I
Is unaware of groups it
does not belong to (D)
2
3
I
1
Works with any conventional Group Membership Service
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Regional Selection
A
A
1
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Regional Selection
I
Select targets for XORs
from regions, not groups
A
A
1
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Regional Selection
cAab
1.0
c
ab
A
A a
I
Select targets for XORs
from regions, not groups
I
From each region, select
proportional fraction of
|x|
· cA
cA : cAx = |A|
1
ac
A
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Regional Selection
cAab
1.0
c
ab
A
A a
I
Select targets for XORs
from regions, not groups
I
From each region, select
proportional fraction of
|x|
· cA
cA : cAx = |A|
1
ac
A
latency ∝
P P 1
s
g datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Scalability in Groups
Claim: Ricochet scales to hundreds of groups.
50000
Recovery %
98
96
94
92
90
Average Recovery Latency
40000
Microseconds
LEC Recovery Percentage
100
30000
20000
10000
2
4
8 16 32 64 128 256 512 1024
Groups per Node (Log Scale)
0
2
4
8 16 32 64 128 256 512 1024
Groups per Node (Log Scale)
Comparision: At 128 groups, NAK/SFEC latency is 8 seconds.
Ricochet is 400 times faster!
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Distribution of Recovery Latency
Claim: Ricochet is reliable and time-critical
96.8% LEC + 3.2% NAK
92% LEC + 8% NAK
30
20
10
0
50
100 150 200
Milliseconds
250
(a) 10% Loss Rate
50
Recovery Percentage
40
0
84% LEC + 16% NAK
50
Recovery Percentage
Recovery Percentage
50
40
30
20
10
0
0
50
100 150 200
Milliseconds
250
(b) 15% Loss Rate
40
LEC
↓
30
20
NAK
↓
10
0
0
50
100 150 200
Milliseconds
(c) 20% Loss Rate
Most lost packets recovered < 50ms by LEC.
Remainder via reactive NAKs.
Bursty Loss: 100 packet burst → 90% recovered at 50 ms avg
Mahesh Balakrishnan
Reliable Communication for Datacenters
250
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Next Step: Dr. Multicast
I
IP Multicast has a bad reputation!
I
I
Unscalable filtering at routers/switches/NICs
Insecure
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Motivation Design and Implementation Evaluation
Next Step: Dr. Multicast
I
IP Multicast has a bad reputation!
I
I
I
Unscalable filtering at routers/switches/NICs
Insecure
Insight: IP Multicast is a shared, controlled resource
I
I
I
I
Transparent interception of socket system calls
Logical address → Set of network (uni/multi)cast addresses
Enforcement of IP Multicast policies
Gossip-based tracking of membership/mappings
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Software Stack
The Big Picture
Crashes
Overloads
Bugs
Exploits
Tempest
DSN 2008
KyotoFS
Disk
Failure,
Disasters
SMFS
HotOS XI
Submitted
Plato
SRDS 2006
Packet
Loss
Compilers, Databases,
Distributed Systems,
Machine Learning,
Filesystems, Operating
Systems, Networking ...
Ricochet
Maelstrom
NSDI 2007
NSDI 2008
ld,
or e
-W m
Mistral
al l-Ti
e
a
MobiHoc
R 2006
Re
Mahesh Balakrishnan
Sequoia
PODC 2007
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Software Stack
The Big Picture
Crashes
Overloads
Bugs
Exploits
Tempest
DSN 2008
What about new abstractions?
KyotoFS
Disk
Failure,
Disasters
SMFS
HotOS XI
Submitted
Plato
SRDS 2006
Packet
Loss
Compilers, Databases,
Distributed Systems,
Machine Learning,
Filesystems, Operating
Systems, Networking ...
Ricochet
Maelstrom
NSDI 2007
NSDI 2008
ld,
or e
-W m
Mistral
al l-Ti
e
a
MobiHoc
R 2006
Re
Mahesh Balakrishnan
Sequoia
PODC 2007
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Service
Future Work
Invocation
(ID …)
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Partition
I
(ID
2…
Partition for Scalability
)
(ID1 …)
0…
(ID
)
Instance
Instance
Instance
Future Work
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Group
(ID
2…
)
I
Partition for Scalability
I
Replicate for Availability /
Fault-Tolerance
(ID1 …)
0…
(ID
)
Group
Partition
Group
Future Work
Replicate
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Future Work
Group
The Datacenter is the Computer
Group
Group
Partition
Rethink Old Abstractions
(ID
Processes, Threads,
2…
Address Space, Protection, )
(ID1 …)
Locks, IPC/RPC,
)
0…
Sockets, Files...
(ID
Invent New Abstractions!
Replicate
Mahesh Balakrishnan
I
Partition for Scalability
I
Replicate for Availability /
Fault-Tolerance
I
The DatacenterOS
Concerns: Performance,
Parallelism, Privacy,
Power...
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Conclusion
I
The Real-Time Datacenter
I
I
Recover from failures within seconds
Reliable Communication: FEC in the Network
I
I
I
Recover lost packets in milliseconds
Maelstrom: between Datacenters
Ricochet: within Datacenters
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Conclusion
I
The Real-Time Datacenter
I
I
Recover from failures within seconds
Reliable Communication: FEC in the Network
I
I
I
Recover lost packets in milliseconds
Maelstrom: between Datacenters
Ricochet: within Datacenters
Thank You!
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: FEC and Bursty Loss
A
I
Existing solution:
interleaving
I
Interleave i and rate
(r , c) tolerates (c ∗ i)
burst...
I
...with i times the latency
Mahesh Balakrishnan
B C D E
F
G H
A C
E
G
I
X X X
B D
F
H
J
X X X
I
J
Figure: Interleave of 2 — Even
and Odd packets encoded
separately
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: FEC and Bursty Loss
A
I
Existing solution:
interleaving
I
Interleave i and rate
(r , c) tolerates (c ∗ i)
burst...
I
...with i times the latency
B C D E
F
G H
A C
E
G
I
X X X
B D
F
H
J
X X X
I
J
Figure: Interleave of 2 — Even
and Odd packets encoded
separately
Wanted: Graceful degradation of recovery latency with actual
burst size for constant overhead
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: Maelstrom Evaluation
Maelstrom goodput is near theoretical maximum
1000
theoretical goodput
M-FEC goodput
Throughput (Mbits/sec)
900
800
700
600
500
400
300
200
100
0
0
1
Mahesh Balakrishnan
2
3
FEC Layers (c)
4
5
6
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Recovery Latency (Milliseconds)
Extra Slide: Layered Interleaving
100
Reed Solomon
Layered Interleaving
80
60
40
20
0
0
200
400
600
800
Recovered Packet #
Mahesh Balakrishnan
1000
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Layered Interleaving
Claim: Recovery Latency depends on Actual Burst Length
I
0
50 100 150 200
Recovery Latency (ms)
100
80
60
40
20
0
40
% Recovered
100
80
60
40
20
0
20
% Recovered
% Recovered
Burst Length = 1
0
50 100 150 200
Recovery Latency (ms)
100
80
60
40
20
0
0
50 100 150 200
Recovery Latency (ms)
Longer Burst Lengths → Longer Recovery Latency
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
I
End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
Possible Reasons:
I
I
I
I
I
I
I
I
transient congestion
degraded fiber
malfunctioning HW
misconfigured HW
switching contention
low receiver power
end-host overflow
...
30
25
% of Measurements
I
24
20
15
14
10
5
0
0.01 0.03 0.05 0.07 0.1
0.3
0.5
% of Lost Packets
Mahesh Balakrishnan
Reliable Communication for Datacenters
0.7
1
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
I
End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
Possible Reasons:
I
I
I
I
I
I
I
I
transient congestion
degraded fiber
malfunctioning HW
misconfigured HW
switching contention
low receiver power
end-host overflow
...
30
25 24
All Measurements
Without Indiana U. Site
% of Measurements
I
20
15
14
14
10
5
3
0
0.01 0.03 0.05 0.07 0.1
0.3
0.5
% of Lost Packets
Mahesh Balakrishnan
Reliable Communication for Datacenters
0.7
1
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
I
End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
Possible Reasons:
I
I
I
I
I
I
I
I
transient congestion
degraded fiber
malfunctioning HW
misconfigured HW
switching contention
low receiver power
end-host overflow
...
30
25 24
All Measurements
Without Indiana U. Site
% of Measurements
I
20
15
14
14
10
5
3
0
0.01 0.03 0.05 0.07 0.1
0.3
0.5
% of Lost Packets
Problem Statement: Run unmodified TCP/IP over lossy
high-speed long-distance networks
Mahesh Balakrishnan
Reliable Communication for Datacenters
0.7
1
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
600
throughput 0.1% loss
Throughput (Mbits/sec)
500
400
300
200
100
0
client-buf M-ALL M-FEC M-BUF
RTT 100ms
Mahesh Balakrishnan
tcp
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
600
throughput 0.1% loss
Throughput (Mbits/sec)
500
←
400
300
Maelstrom FEC
200
100
0
client-buf M-ALL M-FEC M-BUF
RTT 100ms
Mahesh Balakrishnan
tcp
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
600
throughput 0.1% loss
Throughput (Mbits/sec)
500
←
400
300
Maelstrom FEC
200
100
0
↑
client-buf M-ALL M-FEC M-BUF
RTT 100ms
tcp
Maelstrom FEC + Hand-Tuned Buffers
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
Maelstrom FEC + Split Mode
↓
600
throughput 0.1% loss
Throughput (Mbits/sec)
500
←
400
300
Maelstrom FEC
200
100
0
↑
client-buf M-ALL M-FEC M-BUF
RTT 100ms
tcp
Maelstrom FEC + Hand-Tuned Buffers
Mahesh Balakrishnan
Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: FEC mode and loss
Throughput (Mbps)
Claim: Maelstrom works at high loss rates
500
450
400
350
300
250
200
150
100
50
0
Maelstrom
TCP
0.1
1
Packet Loss Rate %
10
Link RTT = 100ms
Mahesh Balakrishnan
Reliable Communication for Datacenters
Related documents