* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Fastpass
Piggybacking (Internet access) wikipedia , lookup
Drift plus penalty wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Asynchronous Transfer Mode wikipedia , lookup
Multiprotocol Label Switching wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
List of wireless community networks by region wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Computer network wikipedia , lookup
Network tap wikipedia , lookup
TCP congestion control wikipedia , lookup
Airborne Networking wikipedia , lookup
Distributed firewall wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Quality of service wikipedia , lookup
Deep packet inspection wikipedia , lookup
A Centralized ”Zero-Queue” Datacenter Network Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, Hans Fugal M.I.T.Computer Science & Artificial Intelligence Lab Facebook 1 Current Situation • Current datacenter architectures 1. distribute packet transmission decisions among the endpoints (“congestion control”) 2. path selection among the network’s switches (“routing”) • Advantage : fault-tolerance and scalability • Disadvantage : a loss of control over packet delays and paths taken. 2 Fastpass • Fastpass : a datacenter network architecture that each packet’s timing be controlled by a logically centralized arbiter, which also determines the packet’s path. • Objectives 1. no queue and low latency 2. high throughput 3. fairness and quick convergence 3 Fastpass Components 1. Timeslot allocation algorithm: at the arbiter to determine when each endpoint’s packets should be sent. 2. Path assignment algorithm: at the arbiter to assign a path to each packet. 3. Fastpass Control Protocol (FCP) : A replication strategy for the central arbiter to handle network and arbiter failures, as well as a reliable control protocol between endpoints and the arbiter. 4 Fastpass Architecture • In Fastpass, a logically centralized arbiter controls all network Transfers. • The arbiter processes each request, performing two functions(fig.2): 1. Timeslot allocation 2. Path selection 5 Timeslot Allocation • Goal : choose a matching of endpoints in each timeslot • Organized tiers: 1. top-of-rack switches: connect servers in a rack 2. aggregation switches: connect racks into clusters 3. core routers: connect clusters • RNB(rearrangeably non blocking) property allows the arbiter to perform timeslot allocation separately from path selection 6 Timeslot Allocation Allocation algorithm • How fast? 1. In heavily-loaded networks 2. In lightly-loaded networks • Algorithm: arbiter greedily allocates a pair if allocating the pair does not violate bandwidth constraints. When it finishes processing all demands, it has a maximal matching, a matching in which none of the unallocated demands can be allocated while maintaining the bandwidth constraints. 7 Timeslot Allocation A pipelined allocator • A pipelined allocator(fig.3) • Max-min fairness(fig.4) • Reordering problem • Reduce the overhead of processing demands 8 Path Selelction • Goal 1. No link is assigned multiple packets in a single timeslot 2. No queueing within the network • Algorithm 1. Edge-coloring(fig.5) 2. Fast edge-coloring 9 Handing Faults 1. failures of the Fastpass arbiter 2. failures of in-network components (nodes, switches and links) 3. packet losses on the communication path between endpoints and the arbiter 10 Implementation • Arbiter : use Intel DPDK(Data Plane Development Kit), a framework that allows direct access to NIC queues from user space • Endpoints: need a Linux kernel module that queues packets while requests are sent to the arbiter • FCP : a Linux transport protocol • A Fastpass qdisc (queue discipline) • Obviate the need for TCP congestion control, not modify TCP • Greedily sends packets in a timeslot 11 Implementation • Multicore Arbiter 1. comm-core: communicates with endpoints, handles a subset of endpoints 2. alloc-cores: perform timeslot allocation, work concurrently using pipeline parallelism 3. pathsel-cores: assign paths • Progress:Comm-cores receive endpoint demands and pass them to alloccores. Once a timeslot is completely allocated, it is promptly passed to a pathsel-core. The assigned paths are handed to comm-cores, which notify each endpoint of its allocations. 12 Evaluation • Goal: 1. eliminate in-network queueing 2. achieve high throughput 3. support various inter-flow or inter-application resource allocation objectives in a real-world datacenter network. 13 Evaluation Experimental environment Cluster 10Gb/s Cluster sw TOR sw 10Gb/s 10Gb/s 1、a single rack of 32 servers 2、each server connects to a ToR switch with a main 10 Gbits/s Ethernet network interface card (NIC) and a 1 Gbit/s NIC meant for out-of-band communication 3、each server has 2 Intel CPUs with 8 cores and 148 GB RAM 4、 One server is set aside for running the arbiter. 14 Evaluation Experiment A: throughput and queueing Under a bulk transfer workload involving multiple machines, Fastpass reduces median switch queue length to 18KB from 4351KB,with a 1.6% throughput penalty. 15 Evaluation Experiment B: latency Interactivity: under the same workload, Fastpass’s median ping time is 0.23 ms vs the baseline’s 3.56 ms,15.5× lower with Fastpass. 16 Evaluation Experiment C: fairness and convergence Fairness: Fastpass reduces standard deviations of per-sender throughput over 1s intervals by over 5200× for 5 connections. 17 Evaluation Experiment D: request queueing Experiment E: communication overhead D: Each comm-core supports 130Gbits/s of network traffic with 1µs of NIC queueing. (fig.10) E: Arbiter traffic imposes a 0.3% throughput overhead.(fig.11) 18 Evaluation Experiment F: timeslot allocation cores Experiment G: path selection cores F: 8 alloc-cores support 2.2 Terabits/s of network traffic. G: 10 pathsel-cores support > 5 Terabits/s of network traffic. 19 Evaluation Production experiments at Facebook --Experiment H: production results 20 Discussion • A core arbiter would have to handle a large volume of traffic, so allocating at MTU-size granularity would not be computationally feasible. Solution:the use of specialized hardware • Packets must follow the paths allocated by the arbiter. Routers typically support IP source routing only in software, rendering it too slow for practical use. Solution:ECMP spoofing 21 Limitation • The granularity of timeslots:if an endpoint has less than a full timeslot worth of data to send, network bandwidth is left unused. • A possible mitigation: the arbiter divides some timeslots into smaller fragments and allocates these fragments to the smaller packets. 22 Conclusion • Fastpass 1. High throughput with nearly-zero queues 2. Consistent (fair) throughput allocation and fast convergence 3. Scalability 4. Fine-grained timing 5. Reduced retransmissions • Practicablity 1. reducing the tail of the packet delay distribution 2. avoiding false congestion 3. eliminating incast 4. better sharing for heterogeneous workloads 5. with different performance objectives 23 24