Download PPT

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam (Google), Francis Matus, Rong Pan, Navindra Yadav, George Varghese (Microsoft) Cisco Systems SIGCOMM 2014 Pengcheng Zhou July 12, 2016 Outline 1. Background 2. Prior Work 3. The Design and Implementation of CONGA 4. Evaluation 5. Conclusion 6. Personal Thoughts Background Large bisection bandwidth in datacenter • bisection bandwidth (对分带宽): In computer networking, if the nodes of a network are bisected into two partitions, the bisection bandwidth is the bandwidth available between the two partitions. Generally, this refers to the worst-case bisection, so long as there are the same number of nodes in each partition. • Datacenter networks must provide large bisection bandwidth to support an ever increasing array of applications, ranging from financial services to big-data analytics. Large bisection bandwidth in datacenter • Seminal papers such as VL2 [18] and Portland [1] showed how to achieve this with Clos topologies, Equal Cost MultiPath (ECMP) load balancing, and the decoupling of endpoint addresses from their location. • These design principles are followed by next generation overlay technologies that accomplish the same goals using standard encapsulations such as VXLAN [35] and NVGRE [45]. Equal Cost MultiPath (ECMP) ECMP can balance load poorly, because: • ECMP randomly hashes flows to paths, hash collisions can cause significant imbalance if there are a few large flows. • More importantly, ECMP uses a purely local decision to split traffic among equal cost paths without knowledge of potential downstream congestion on each path. Thus ECMP fares poorly with asymmetry caused by link failures that occur frequently and are disruptive in datacenters Prior Work Prior work on addressing ECMP’s shortcomings Divided into 3 groups: • Centralized scheduling (e.g., Hedera [2]) • local switch mechanisms (e.g., Flare [27]) • Host-based transport protocols (e.g., MPTCP [41]) Drawbacks of prior work • Centralized schemes are too slow for the traffic volatility in datacenters. • local congestion-aware mechanisms are suboptimal and can perform even worse than ECMP with asymmetry. • Host-based methods such as MPTCP are challenging to deploy because network operators often do not control the end-host stack. So now comes the question: • Can load balancing be done in the network without adding to the complexity of the transport layer? • Can such a network-based approach compute globally optimal allocations, and yet be implementable in a realizable and distributed fashion to allow rapid reaction in microseconds? • Can such a mechanism be deployed today using standard encapsulation formats? The Design and Implementation of CONGA Design space for load balancing Why Distributed Load Balancing? • First, datacenter traffic is very bursty and unpredictable. CONGA reacts to congestion at RTT timescales (~100μs) making it more adept at handling high volatility than a centralized scheduler. • Second, datacenters use very regular topologies. Experiments and analysis show that distributed decisions are close to optimal in such regular topologies. A centralized approach is appropriate for WANs where traffic is stable and predictable and the topology is arbitrary. Why In-Network Load Balancing? • While MPTCP is effective for load balancing, its use of multiple sub-flows actually increases congestion at the edge of the network and degrades performance in Incast scenarios. • Datacenter fabric load balancing is too specific and architecturally unappealing to be implemented in the transport stack which already needs to balance multiple important requirements (e.g., high throughput, low latency, and burst tolerance). Why Global Congestion Awareness? Handling asymmetry essentially requires non-local knowledge about downstream congestion at the switches. Why Leaf-to-Leaf Feedback? The overlay network provides the ideal conduit for CONGA’s leaf-to-leaf congestion feedback mechanism. CONGA leverages two key properties of the overlay: • The source leaf knows the ultimate destination leaf for each packet, in contrast to standard IP forwarding where the switches only know the next-hops. • The encapsulated packet has an overlay header which can be used to carry congestion metrics between the leaf switches. Why Flowlet Switching for Datacenters? Flowlets are bursts of packets from a flow that are separated by large enough gaps. The plot shows that balancing flowlets gives significantly more fine-grained control than balancing flows. CONGA system diagram Packet format: • LBTag (4 bits): Partially identifies the packet’s path. It is set by the source leaf to the (switch-local) port number of the uplink the packet is sent on and is used by the destination leaf to aggregate congestion metrics before they are fed back to the source. • CE (3 bits): This field is used by switches along the packet’s path to convey the extent of congestion. • FB_LBTag (4 bits) and FB_Metric (3 bits): These two fields are used by destination leaves to piggyback congestion information back to the source leaves. FB_LBTag indicates the LBTag the feedback is for and FB_Metric provides its associated congestion metric. Discounting Rate Estimator (DRE) CONGA measures congestion using the Discounting Rate Estimator (DRE), a simple module present at each fabric link. DRE maintains a register, X, which is incremented for each packet sent over the link by the packet size in bytes, and is decremented periodically (every 𝑻𝒅𝒓𝒆 ) with a multiplicative factor between 0 and 1: 𝑿 = 𝑿 ∗ (𝟏 − 𝜶). If the traffic rate is R, then 𝑿 ≈ 𝑹 ∗ 𝝉 , where 𝝉 = 𝑻𝒅𝒓𝒆 /𝜶. The congestion metric for the link is obtained by quantizing 𝑿/𝑪𝝉 to 3 bits (C is the link speed). Load Balancing Decision Logic For a new flowlet, we pick the uplink port that minimizes the maximum of the local metric (from the local DREs) and the remote metric (from the Congestion-To-Leaf Table). Evaluation Topologies and traffic distributions Experiments results for baseline topology Experiments results for topology with link failure Load balancing efficiency • Figure 12 shows the CDF of the throughput imbalance across the 4 uplinks at Leaf 0 in the baseline topology (without link failure) for both workloads at the 60% load level. • The throughput imbalance is defined as the maximum throughput (among the 4 uplinks) minus the minimum divided average:(𝑴𝑨𝑿 − 𝑴𝑰𝑵)/𝑨𝑽𝑮. by the Incast scenario • The plots confirm that MPTCP significantly degrades performance in Incast scenarios. • CONGA +TCP achieves 2–8 × better throughput than MPTCP in similar settings. HDFS Benchmark • For the baseline topology (without failures), ECMP and CONGA have nearly identical performance. MPTCP has some outlier trials with much higher job completion times. • For the asymmetric topology with link failure, the job completion times for ECMP are nearly twice as large as without the link failure. But CONGA is robust; comparing Figures 14 (a) and (b) shows that the link failure has almost no impact on the job completion times with CONGA. Results of Large-Scale Simulations Conclusion Conclusion • Datacenter load balancing is best done in the network instead of the transport layer, and requires global congestion-awareness to handle asymmetry. • CONGA provides better performance than MPTCP without important drawbacks such as complexity and rigidity at the transport layer and poor Incast performance. • CONGA seamlessly handles asymmetries in the topology or network traffic. • CONGA leverages an existing datacenter overlay to implement a leaf-to-leaf feedback loop and can be deployed without any modifications to the TCP stack. Personal Thoughts Personal Thoughts 1.由于反馈是捎带的，如果反馈频率低会不会造成拥塞？不会。因为做LB决策的时候同时考虑了本地DRE和Congestion-To-Leaf表里面的拥塞度量信息。 2.相同LBTag的报文可能经过不同的path到达destination，而反馈的是最后一个收到报文的拥塞度量值，会不会不能反映真实的拥塞情况？不会。因为在TCP传输中，除非链路发生拥塞或者故障，否则，不会换路。 3.根据文章第6部分对“Impact of Workload on Load Balancing”的分析结论，CONGA本质上是对大流进行分解。但是flowlet尺度的调度，实际上会造成flowlet size的不均匀，仍然会有一些大的 flowlet，并且可能对一些小流进行不必要的分解。flowlet尺度的方案是否合理？ Presto说：不合理，应该拆成64KB的flowcell。 He K, Rozner E, Agarwal K, et al. Presto:Edge-based Load Balancing for Fast Datacenter Networks[J]. Acm Sigcomm Computer Communication Review, 2015, 45(5):465-478.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT