Download Fastpass

A Centralized ”Zero-Queue” Datacenter Network Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, Hans Fugal M.I.T.Computer Science & Artificial Intelligence Lab Facebook 1 Current Situation • Current datacenter architectures 1. distribute packet transmission decisions among the endpoints (“congestion control”) 2. path selection among the network’s switches (“routing”) • Advantage : fault-tolerance and scalability • Disadvantage : a loss of control over packet delays and paths taken. 2 Fastpass • Fastpass : a datacenter network architecture that each packet’s timing be controlled by a logically centralized arbiter, which also determines the packet’s path. • Objectives 1. no queue and low latency 2. high throughput 3. fairness and quick convergence 3 Fastpass Components 1. Timeslot allocation algorithm: at the arbiter to determine when each endpoint’s packets should be sent. 2. Path assignment algorithm: at the arbiter to assign a path to each packet. 3. Fastpass Control Protocol (FCP) : A replication strategy for the central arbiter to handle network and arbiter failures, as well as a reliable control protocol between endpoints and the arbiter. 4 Fastpass Architecture • In Fastpass, a logically centralized arbiter controls all network Transfers. • The arbiter processes each request, performing two functions(fig.2): 1. Timeslot allocation 2. Path selection 5 Timeslot Allocation • Goal : choose a matching of endpoints in each timeslot • Organized tiers: 1. top-of-rack switches: connect servers in a rack 2. aggregation switches: connect racks into clusters 3. core routers: connect clusters • RNB(rearrangeably non blocking) property allows the arbiter to perform timeslot allocation separately from path selection 6 Timeslot Allocation Allocation algorithm • How fast？ 1. In heavily-loaded networks 2. In lightly-loaded networks • Algorithm: arbiter greedily allocates a pair if allocating the pair does not violate bandwidth constraints. When it finishes processing all demands, it has a maximal matching, a matching in which none of the unallocated demands can be allocated while maintaining the bandwidth constraints. 7 Timeslot Allocation A pipelined allocator • A pipelined allocator(fig.3) • Max-min fairness(fig.4) • Reordering problem • Reduce the overhead of processing demands 8 Path Selelction • Goal 1. No link is assigned multiple packets in a single timeslot 2. No queueing within the network • Algorithm 1. Edge-coloring(fig.5) 2. Fast edge-coloring 9 Handing Faults 1. failures of the Fastpass arbiter 2. failures of in-network components (nodes, switches and links) 3. packet losses on the communication path between endpoints and the arbiter 10 Implementation • Arbiter : use Intel DPDK(Data Plane Development Kit), a framework that allows direct access to NIC queues from user space • Endpoints: need a Linux kernel module that queues packets while requests are sent to the arbiter • FCP : a Linux transport protocol • A Fastpass qdisc (queue discipline) • Obviate the need for TCP congestion control, not modify TCP • Greedily sends packets in a timeslot 11 Implementation • Multicore Arbiter 1. comm-core: communicates with endpoints, handles a subset of endpoints 2. alloc-cores: perform timeslot allocation, work concurrently using pipeline parallelism 3. pathsel-cores: assign paths • Progress：Comm-cores receive endpoint demands and pass them to alloccores. Once a timeslot is completely allocated, it is promptly passed to a pathsel-core. The assigned paths are handed to comm-cores, which notify each endpoint of its allocations. 12 Evaluation • Goal: 1. eliminate in-network queueing 2. achieve high throughput 3. support various inter-flow or inter-application resource allocation objectives in a real-world datacenter network. 13 Evaluation Experimental environment Cluster 10Gb/s Cluster sw TOR sw 10Gb/s 10Gb/s 1、a single rack of 32 servers 2、each server connects to a ToR switch with a main 10 Gbits/s Ethernet network interface card (NIC) and a 1 Gbit/s NIC meant for out-of-band communication 3、each server has 2 Intel CPUs with 8 cores and 148 GB RAM 4、 One server is set aside for running the arbiter. 14 Evaluation Experiment A: throughput and queueing Under a bulk transfer workload involving multiple machines, Fastpass reduces median switch queue length to 18KB from 4351KB,with a 1.6% throughput penalty. 15 Evaluation Experiment B: latency Interactivity: under the same workload, Fastpass’s median ping time is 0.23 ms vs the baseline’s 3.56 ms,15.5× lower with Fastpass. 16 Evaluation Experiment C: fairness and convergence Fairness: Fastpass reduces standard deviations of per-sender throughput over 1s intervals by over 5200× for 5 connections. 17 Evaluation Experiment D: request queueing Experiment E: communication overhead D: Each comm-core supports 130Gbits/s of network traffic with 1µs of NIC queueing. (fig.10) E: Arbiter traffic imposes a 0.3% throughput overhead.(fig.11) 18 Evaluation Experiment F: timeslot allocation cores Experiment G: path selection cores F: 8 alloc-cores support 2.2 Terabits/s of network traffic. G: 10 pathsel-cores support > 5 Terabits/s of network traffic. 19 Evaluation Production experiments at Facebook --Experiment H: production results 20 Discussion • A core arbiter would have to handle a large volume of traffic, so allocating at MTU-size granularity would not be computationally feasible.  Solution：the use of specialized hardware • Packets must follow the paths allocated by the arbiter. Routers typically support IP source routing only in software, rendering it too slow for practical use.  Solution：ECMP spoofing 21 Limitation • The granularity of timeslots：if an endpoint has less than a full timeslot worth of data to send, network bandwidth is left unused. • A possible mitigation: the arbiter divides some timeslots into smaller fragments and allocates these fragments to the smaller packets. 22 Conclusion • Fastpass 1. High throughput with nearly-zero queues 2. Consistent (fair) throughput allocation and fast convergence 3. Scalability 4. Fine-grained timing 5. Reduced retransmissions • Practicablity 1. reducing the tail of the packet delay distribution 2. avoiding false congestion 3. eliminating incast 4. better sharing for heterogeneous workloads 5. with different performance objectives 23 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Fastpass