Download 20031015-FAST-Ravot

Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom 2003 - Geneva October 15, 2003 [email protected] Agenda  High TCP performance over wide area networks :  TCP at Gbps speed  MTU bias  RTT bias  TCP fairness  How to use 100% of the link capacity with TCP Reno  Network buffers impact  New Internet2 Land Speed record Single TCP stream performance under periodic losses Bandwidth Utilization (%) Effect of packet loss 100 90 80 70 60 50 40 30 20 10 0 0.000001 Loss rate =0.01%: LAN BW utilization= 99% WAN BW utilization=1.2% 0.00001 0.0001 0.001 0.01 0.1 Packet Loss frequency (%) WAN (RTT=120ms) LAN (RTT=0.04 ms) 1 10 Bandwidth available = 1 Gbps  TCP throughput is much more sensitive to packet loss in WANs than in LANs  TCP’s congestion control algorithm (AIMD) is not suited to gigabit networks  Poor limited feedback mechanisms  The effect of packets loss is disastrous  TCP is inefficient in high bandwidth*delay networks  The future performance of computational grids looks bad if we continue to rely on the widely-deployed TCP RENO Responsiveness (I)  The responsiveness r measures how quickly we go back to using the network link at full capacity after experiencing a loss if we assume that the congestion window size is equal to the Bandwidth Delay product when the packet is lost. 2 C . RTT r= 2 . MSS C : Capacity of the link TCP responsiveness 18000 16000 14000 Time (s) 12000 C= 622 Mbit/s C= 2.5 Gbit/s 10000 8000 C= 10 Gbit/s 6000 4000 2000 0 0 50 100 RTT (ms) 150 200 Responsiveness (II) Case C RTT (ms) MSS (Byte) Responsiveness Typical LAN today 1 Gb/s 2 (worst case) 1460 96 ms WAN Geneva <-> Chicago 1 Gb/s 120 1460 10 min WAN Geneva <-> Sunnyvale 1 Gb/s 180 1460 23 min WAN Geneva <-> Tokyo 1 Gb/s 300 1460 1 h 04 min WAN Geneva <-> Sunnyvale 2.5 Gb/s 180 1460 58 min Future WAN CERN <-> Starlight 10 Gb/s 120 1460 1 h 32 min Future WAN link CERN <-> Starlight 10 Gb/s 120 8960 (Jumbo Frame) 15 min The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the responsiveness is multiplied by two. Therefore, values above have to be multiplied by two! Single TCP stream TCP connection between Geneva and Chicago: C=1 Gbit/s; MSS=1,460 Bytes; RTT=120ms 35 minutes     Time to increase the throughout from 100Mbps to 900Mbps = 35 minutes Loss occurs when the bandwidth reaches the pipe size 75% of bandwidth utilization (assuming no buffering) Cwnd<BDP : Throughput < Bandwidth RTT constant Throughput = Cwnd / RTT Measurements with Different MTUs TCP connection between Geneva and Chicago: C=1 Gbit/s; RTT=120ms  In both cases: 75% of the link utilization  Large MTU accelerate the growth of the window  Time to recover from a packet loss decreases with large MTU  Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of packets) MTU and Fairness Host #1 1 GE GbE Switch Host #2 CERN (GVA) 1 GE     R Host #1 Host #2 1 GE Bottleneck  Two TCP streams share a 1 Gbps  R 1 GE POS 2.5 Gbps bottleneck RTT=117 ms MTU = 1500 Bytes; Avg. throughput over a period of 4000s = 50 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 4000s = 698 Mb/s Factor 14 ! Connections with large MTU increase quickly their rate and grab most of the available bandwidth Starlight (Chi) RTT and Fairness Host #1 1 GE GbE Switch Host #2     R POS 2.5 Gb/s 1 GE R 10GE R POS 10 Gb/s R Host #2 1 GE Host #1 Bottleneck CERN (GVA)  1 GE Starlight (Chi) Sunnyvale Two TCP streams share a 1 Gbps bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Connection with small RTT increases quickly there rate and grab most of the available bandwidth Throughput of two streams with different RTT sharing a 1Gbps bottleneck 1000 900 Throughput (Mbps) 800 RTT=181ms 700 600 Average over the life of the connection RTT=181ms RTT=117ms 500 400 300 Average over the life of the connection RTT=117ms 200 100 0 0 1000 2000 3000 4000 Time (s) 5000 6000 7000 How to use 100% of the bandwidth?  Single TCP stream GVA - CHI  MSS=8960 Bytes; Throughput = 980Mbps  Cwnd > BDP => Throughput = Bandwidth  RTT increase  Extremely Large buffer at the bottleneck  Network buffers have an important impact on performance  Have buffers to be well dimensioned in order to scale with the BDP?  Why not use the end-to-end delay as congestion indication. Bandwidth delay product Single stream TCP performance Date From Geneva to Size of transfer Duration (second) RTT (ms) MTU (Bytes) IP version Throughput Record Award Feb 27 Sunnyvale 1,1 TByte 3700 180 9000 IPv4 2.38 Gbps Internet2 LSR CENIC award Guinness World Record May 27 Tokyo 65.1 GByte 600 277 1500 IPv4 931 Mbps May 2 Chicago 385 GByte 3600 120 1500 IPv6 919 Mbps May 2 Chicago 412 GByte 3600 120 9000 IPv6 983 Mbps Internet2 LSR NEW Submission (Oct-11): 5.65 Gbps from Geneva to Los Angeles across the LHCnet, Starlight, Abilene and CENIC. Early 10 Gb/s 10,000 km TCP Testing Monitoring of the Abilene traffic in LA     Single TCP stream at 5,65 Gbps Transferring a full CD in less than 1s Un-congestioned network No packet loss during the transfer  Probably qualifies as new Internet2 LSR Conclusion  The future performance of computational grids looks bad if we continue to rely on the widely-deployed TCP RENO  How to define the fairness? Taking into account the MTU Taking into account the RTT  Larger packet size (Jumbogram : payload larger than 64K) Is standard MTU the largest bottleneck? New Intel 10GE cards : MTU=16K J. Cain (Cisco): “It’s very difficult to build switches to switch large packets such as jumbogram”  Our vision of the network: “The network, once viewed as an obstacle for virtual collaborations and distributed computing in grids, can now start to be viewed as a catalyst instead. Grid nodes distributed around the world will simply become depots for dropping off information for computation or storage, and the network will become the fundamental fabric for tomorrow's computational grids and virtual supercomputers”

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 20031015-FAST-Ravot