Download PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CONGA: Distributed Congestion-Aware Load Balancing
for Datacenters
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu,
Andy Fingerhut, Vinh The Lam (Google), Francis Matus, Rong Pan, Navindra Yadav,
George Varghese (Microsoft)
Cisco Systems
SIGCOMM 2014
Pengcheng Zhou
July 12, 2016
Outline
1. Background
2. Prior Work
3. The Design and Implementation of CONGA
4. Evaluation
5. Conclusion
6. Personal Thoughts
Background
Large bisection bandwidth in datacenter
• bisection bandwidth (对分带宽): In computer networking, if the nodes of a
network are bisected into two partitions, the bisection bandwidth is the
bandwidth available between the two partitions. Generally, this refers to the
worst-case bisection, so long as there are the same number of nodes in each
partition.
• Datacenter networks must provide large bisection bandwidth to support an ever
increasing array of applications, ranging from financial services to big-data
analytics.
Large bisection bandwidth in datacenter
• Seminal papers such as VL2 [18] and Portland [1] showed how to achieve this
with Clos topologies, Equal Cost MultiPath (ECMP) load balancing, and the
decoupling of endpoint addresses from their location.
• These design principles are followed by next generation overlay technologies
that accomplish the same goals using standard encapsulations such as VXLAN
[35] and NVGRE [45].
Equal Cost MultiPath (ECMP)
ECMP can balance load poorly, because:
• ECMP randomly hashes flows to paths, hash collisions can cause significant
imbalance if there are a few large flows.
• More importantly, ECMP uses a purely local decision to split traffic among
equal cost paths without knowledge of potential downstream congestion on
each path.
Thus ECMP fares poorly with asymmetry caused by link failures that occur
frequently and are disruptive in datacenters
Prior Work
Prior work on addressing ECMP’s shortcomings
Divided into 3 groups:
• Centralized scheduling (e.g., Hedera [2])
• local switch mechanisms (e.g., Flare [27])
• Host-based transport protocols (e.g., MPTCP [41])
Drawbacks of prior work
• Centralized schemes are too slow for the traffic volatility in datacenters.
• local congestion-aware mechanisms are suboptimal and can perform even
worse than ECMP with asymmetry.
• Host-based methods such as MPTCP are challenging to deploy because
network operators often do not control the end-host stack.
So now comes the question:
• Can load balancing be done in the network without adding to the complexity of
the transport layer?
• Can such a network-based approach compute globally optimal allocations, and
yet be implementable in a realizable and distributed fashion to allow rapid
reaction in microseconds?
• Can such a mechanism be deployed today using standard encapsulation formats?
The Design and Implementation of CONGA
Design space for load balancing
Why Distributed Load Balancing?
• First, datacenter traffic is very bursty and unpredictable.
CONGA reacts to congestion at RTT timescales (~100μs) making it more adept at
handling high volatility than a centralized scheduler.
• Second, datacenters use very regular topologies. Experiments and analysis show that
distributed decisions are close to optimal in such regular topologies.
A centralized approach is appropriate for WANs where traffic is stable and predictable and
the topology is arbitrary.
Why In-Network Load Balancing?
• While MPTCP is effective for load balancing, its use of multiple sub-flows actually
increases congestion at the edge of the network and degrades performance in Incast
scenarios.
• Datacenter fabric load balancing is too specific and architecturally unappealing to be
implemented in the transport stack which already needs to balance multiple important
requirements (e.g., high throughput, low latency, and burst tolerance).
Why Global Congestion Awareness?
Handling asymmetry essentially requires non-local knowledge about downstream congestion at the
switches.
Why Leaf-to-Leaf Feedback?
The overlay network provides the ideal conduit for CONGA’s
leaf-to-leaf
congestion
feedback
mechanism.
CONGA
leverages two key properties of the overlay:
• The source leaf knows the ultimate destination leaf for each
packet, in contrast to standard IP forwarding where the
switches only know the next-hops.
• The encapsulated packet has an overlay header which can be
used to carry congestion metrics between the leaf switches.
Why Flowlet Switching for Datacenters?
Flowlets are bursts of packets from a flow that are separated by large enough gaps.
The plot shows that balancing flowlets gives significantly more fine-grained control
than balancing flows.
CONGA system diagram
Packet format:
• LBTag (4 bits): Partially identifies the packet’s path. It
is set by the source leaf to the (switch-local) port
number of the uplink the packet is sent on and is used by
the destination leaf to aggregate congestion metrics
before they are fed back to the source.
• CE (3 bits): This field is used by switches along the
packet’s path to convey the extent of congestion.
• FB_LBTag (4 bits) and FB_Metric (3 bits): These two
fields are used by destination leaves to piggyback
congestion information back to the source leaves.
FB_LBTag indicates the LBTag the feedback is for and
FB_Metric provides its associated congestion metric.
Discounting Rate Estimator (DRE)
CONGA measures congestion using the Discounting Rate Estimator (DRE), a simple module present
at each fabric link.
DRE maintains a register, X, which is incremented for each packet sent over the link by the packet
size in bytes, and is decremented periodically (every 𝑻𝒅𝒓𝒆 ) with a multiplicative factor between 0
and 1: 𝑿 = 𝑿 ∗ (𝟏 − 𝜶).
If the traffic rate is R, then 𝑿 ≈ 𝑹 ∗ 𝝉 , where 𝝉 = 𝑻𝒅𝒓𝒆 /𝜶.
The congestion metric for the link is obtained by quantizing 𝑿/𝑪𝝉 to 3 bits (C is the link speed).
Load Balancing Decision Logic
For a new flowlet, we pick the uplink port that minimizes the maximum of the local metric (from the
local DREs) and the remote metric (from the Congestion-To-Leaf Table).
Evaluation
Topologies and traffic distributions
Experiments results for baseline topology
Experiments results for topology with link failure
Load balancing efficiency
• Figure 12 shows the CDF of the throughput
imbalance across the 4 uplinks at Leaf 0 in the
baseline topology (without link failure) for both
workloads at the 60% load level.
• The throughput imbalance is defined as the
maximum throughput (among the 4 uplinks)
minus
the
minimum
divided
average:(𝑴𝑨𝑿 − 𝑴𝑰𝑵)/𝑨𝑽𝑮.
by
the
Incast scenario
• The plots confirm that MPTCP significantly
degrades performance in Incast scenarios.
• CONGA +TCP achieves 2–8 × better throughput
than MPTCP in similar settings.
HDFS Benchmark
• For the baseline topology (without failures),
ECMP and CONGA have nearly identical
performance. MPTCP has some outlier trials with
much higher job completion times.
• For the asymmetric topology with link failure, the
job completion times for ECMP are nearly twice
as large as without the link failure. But CONGA
is robust; comparing Figures 14 (a) and (b) shows
that the link failure has almost no impact on the
job completion times with CONGA.
Results of Large-Scale Simulations
Conclusion
Conclusion
• Datacenter load balancing is best done in the network instead of the transport layer, and
requires global congestion-awareness to handle asymmetry.
• CONGA provides better performance than MPTCP without important drawbacks such as
complexity and rigidity at the transport layer and poor Incast performance.
• CONGA seamlessly handles asymmetries in the topology or network traffic.
• CONGA leverages an existing datacenter overlay to implement a leaf-to-leaf feedback
loop and can be deployed without any modifications to the TCP stack.
Personal Thoughts
Personal Thoughts
1.由于反馈是捎带的,如果反馈频率低会不会造成拥塞?
不会。因为做LB决策的时候同时考虑了本地DRE和Congestion-To-Leaf表里面的拥塞度量信息。
2.相同LBTag的报文可能经过不同的path到达destination,而反馈的是最后一个收到报文的拥塞度量
值,会不会不能反映真实的拥塞情况?
不会。因为在TCP传输中,除非链路发生拥塞或者故障,否则,不会换路。
3.根据文章第6部分对“Impact of Workload on Load Balancing”的分析结论,CONGA本质上是对大
流进行分解。但是flowlet尺度的调度,实际上会造成flowlet size的不均匀,仍然会有一些大的
flowlet,并且可能对一些小流进行不必要的分解。flowlet尺度的方案是否合理?
Presto说:不合理,应该拆成64KB的flowcell。
He K, Rozner E, Agarwal K, et al. Presto:Edge-based Load Balancing for Fast Datacenter
Networks[J]. Acm Sigcomm Computer Communication Review, 2015, 45(5):465-478.