Download Lecture-9 on 10/22/2009 - Computer Science and Engineering

CSE 124 Networked Services Fall 2009 B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but not limited to the slides from the text books by Kurose and Ross, digital libraries such as IEEE/ACM digital libraries and slides from Prof. Vahdat. Use of these slides other than for pedagogical purpose for CSE 124, may require explicit permissions from the respective sources. 10/22/2009 CSE 124 Networked Services Fall 2009 1 Announcements • Programming Assignment 1 – Submission window 23-26th October • Week-3 Homework – Due on 26th October • First Paper Discussion – Discussion on 29th October – Write-up due on: 28th October • Midterm: November 5 10/22/2009 CSE 124 Networked Services Fall 2009 2 TCP Round Trip Time and Timeout EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT RTT: gaia.cs.umass.edu to fantasia.eurecom.fr  Exponential weighted moving RTT (milliseconds) average  influence of past sample decreases exponentially fast  typical value:  = 0.125 350 300 250 200 150 100 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 time (seconnds) SampleRTT 10/22/2009 CSE 124 Networked Services Fall 2009 Estimated RTT 3 TCP Round Trip Time and Timeout Setting the timeout • EstimtedRTT plus “safety margin” – large variation in EstimatedRTT -> larger safety margin • first estimate of how much SampleRTT deviates from EstimatedRTT: DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT| (typically,  = 0.25) Then set timeout interval: TimeoutInterval = EstimatedRTT + 4*DevRTT TimeoutInterval is expoentially increased with 10/22/2009 CSE 124 Networked Services Fall 2009 every retransmission 4 Fast Retransmit • time-out period often relatively long: – long delay before resending lost packet • detect lost segments via duplicate ACKs. – sender often sends many segments back-to-back – if segment is lost, there will likely be many duplicate ACKs for that segment 10/22/2009 • If sender receives 3 ACKs for same data, it assumes that segment after ACKed data was lost: – fast retransmit: resend segment before timer expires CSE 124 Networked Services Fall 2009 5 Host A seq # x1 seq # x2 seq # x3 seq # x4 seq # x5 Host B X ACK x1 ACK x1 ACK x1 ACK x1 timeout triple duplicate ACKs time 10/22/2009 CSE 124 Networked Services Fall 2009 6 TCP congestion control:  TCP sender should transmit as fast as possible, but without congesting network  Q: how to find rate just below congestion level  decentralized: each TCP sender sets its own rate, based on implicit feedback:  ACK: segment received (a good thing!), network not congested, so increase sending rate  lost segment: assume loss due to congested network, so decrease sending rate 10/22/2009 CSE 124 Networked Services Fall 2009 7 TCP congestion control: bandwidth probing  “probing for bandwidth”: increase transmission rate on receipt of ACK, until eventually loss occurs, then decrease transmission rate  continue to increase on ACK, decrease on loss (since available bandwidth is changing, depending on other connections in network) ACKs being received, so increase rate X loss, so decrease rate sending rate X X X TCP’s “sawtooth” behavior X time  Q: how fast to increase/decrease? 10/22/2009 details to follow CSE 124 Networked Services Fall 2009 8 TCP Congestion Control: details • sender limits rate by limiting number of unACKed bytes “in pipeline”: LastByteSent-LastByteAcked  cwnd – cwnd: differs from rwnd (how, why?) – sender limited by min(cwnd,rwnd) • roughly, rate = cwnd RTT bytes/sec • cwnd is dynamic, function of perceived network congestion 10/22/2009 cwnd bytes CSE 124 Networked Services Fall 2009 RTT ACK(s) 9 TCP Congestion Control: more details segment loss event: reducing cwnd • timeout: no response from receiver – cut cwnd to 1 • 3 duplicate ACKs: at least some segments getting through (recall fast retransmit) ACK received: increase cwnd  slowstart phase:  increase exponentially fast (despite name) at connection start, or following timeout  congestion avoidance:  increase linearly – cut cwnd in half, less aggressively than on timeout 10/22/2009 CSE 124 Networked Services Fall 2009 10 TCP Slow Start 10/22/2009 Host A Host B RTT • when connection begins, cwnd = 1 MSS – example: MSS = 500 bytes & RTT = 200 msec – initial rate = 20 kbps • available bandwidth may be >> MSS/RTT – desirable to quickly ramp up to respectable rate • increase rate exponentially until first loss event or when threshold reached – double cwnd every RTT – done by incrementing cwnd by 1 for every ACK received CSE 124 Networked Services Fall 2009 time 11 TCP slow (exponential) start 10/22/2009 CSE 124 Networked Services Fall 2009 12 Transitioning into/out of slowstart ssthresh: cwnd threshold maintained by TCP • on loss event: set ssthresh to cwnd/2 – remember (half of) TCP rate when congestion last occurred • when cwnd >= ssthresh: transition from slowstart to congestion avoidance phase duplicate ACK dupACKcount++ L cwnd = 1 MSS ssthresh = 64 KB dupACKcount = 0 timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment 10/22/2009 slow start new ACK cwnd = cwnd+MSS dupACKcount = 0 transmit new segment(s),as allowed cwnd > ssthresh L timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment CSE 124 Networked Services Fall 2009 congestion avoidance 13 TCP: congestion avoidance • when cwnd > ssthresh grow cwnd linearly – increase cwnd by 1 MSS per RTT – approach possible congestion slower than in slowstart – implementation: cwnd = cwnd + MSS/cwnd for each ACK received 10/22/2009 AIMD  ACKs: increase cwnd by 1 MSS per RTT: additive increase  loss: cut cwnd in half (non-timeout-detected loss ): multiplicative decrease AIMD: Additive Increase Multiplicative Decrease CSE 124 Networked Services Fall 2009 14 TCP congestion control FSM: details duplicate ACK dupACKcount++ L cwnd = 1 MSS ssthresh = 64 KB dupACKcount = 0 slow start timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment dupACKcount == 3 ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment new ACK cwnd = cwnd+MSS dupACKcount = 0 transmit new segment(s),as allowed new ACK cwnd = cwnd + MSS (MSS/cwnd) dupACKcount = 0 transmit new segment(s),as allowed . cwnd > ssthresh L congestion avoidance timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment timeout ssthresh = cwnd/2 cwnd = 1 dupACKcount = 0 retransmit missing segment duplicate ACK dupACKcount++ New ACK cwnd = ssthresh dupACKcount = 0 dupACKcount == 3 ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment fast recovery duplicate ACK cwnd = cwnd + MSS transmit new segment(s), as allowed 10/22/2009 CSE 124 Networked Services Fall 2009 15 Popular “flavors” of TCP cwnd window size (in segments) TCP Reno ssthresh ssthresh TCP Tahoe Transmission round 10/22/2009 CSE 124 Networked Services Fall 2009 16 Summary: TCP Congestion Control • when cwnd < ssthresh, sender in slow-start phase, window grows exponentially. • when cwnd >= ssthresh, sender is in congestionavoidance phase, window grows linearly. • when triple duplicate ACK occurs, ssthresh set to cwnd/2, cwnd set to ~ ssthresh • when timeout occurs, ssthresh set to cwnd/2, cwnd set to 1 MSS. 10/22/2009 CSE 124 Networked Services Fall 2009 17 Simplified TCP throughput • Average throughout of TCP as function of window size, RTT? – ignoring slow start • let W be window size when loss occurs. – when window is W, throughput is W/RTT – just after loss, window drops to W/2, throughput to W/2RTT. – average throughout: .75 W/RTT 10/22/2009 CSE 124 Networked Services Fall 2009 18 TCP throughput as a function of Loss rate • Assuming in a cycle, 1 packet is lost • Therefore, the loss rate L is obtained as • Since we can get • Throughput = .75 W/RTT= 10/22/2009 CSE 124 Networked Services Fall 2009 19 TCP Futures: TCP over “long, fat pipes” • example: 1500 byte segments, 100ms RTT, want 10 Gbps throughput • requires window size W = 83,333 in-flight segments • throughput in terms of loss rate: 1.22  MSS 10 10  RTT L 9 • Required value of packet loss rate, L = 2x10-10 • Existing TCP may not scale well in future networks • Need new versions of TCP for high-speed 10/22/2009 CSE 124 Networked Services Fall 2009 20 TCP Fairness fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K TCP connection 1 TCP connection 2 10/22/2009 bottleneck router capacity R CSE 124 Networked Services Fall 2009 21 Why is TCP fair? Two competing sessions: • Additive increase gives slope of 1, as throughout increases • multiplicative decrease decreases throughput proportionally R equal bandwidth share loss: decrease window by factor of 2 congestion avoidance: additive increase loss: decrease window by factor of 2 congestion avoidance: additive increase Connection 1 throughput 10/22/2009 R CSE 124 Networked Services Fall 2009 22 Fairness (more) Fairness and UDP • multimedia apps often do not use TCP – do not want rate throttled by congestion control • instead use UDP: – pump audio/video at constant rate, tolerate packet loss 10/22/2009 Fairness and parallel TCP connections • nothing prevents app from opening parallel connections between 2 hosts. • web browsers do this • example: link of rate R supporting 9 connections; – new app asks for 1 TCP, gets rate R/10 – new app asks for 11 TCPs, gets R/2 ! CSE 124 Networked Services Fall 2009 23 Bandwidth sharing with TCP Two TCP flows sharing a link 10/22/2009 TCP and UDP flows sharing a link CSE 124 Networked Services Fall 2009 24 Networks Vs Processors • Network speeds – 100Mbps to 1Gbps to 10Gbps • Network protocol stack throughput – is good for only 100Mbps, – with fine-tuning, OK for 1Gbps – What about 10Gbps? • Example – Payload size: 1460B, 2-3GHz processor – Receive throughput achieved: 750Mbps – Transmit throughput achieved: 1Gbps • Need radical solutions to support 10Gbps and beyond 10/22/2009 CSE 124 Networked Services Fall 2009 25 Where is the overhead? • TCP was suspected of being too complex – In 1989, Clarke, Jacobson and others proved otherwise • The complexity (overhead) lies in – Computing environment where TCP operates • • • • Interrupts OS scheduling Buffering Data movement • Simple solutions that improves performance – Interrupt moderation • NIC waits for multiple packets and notify the processor once • Amortize the high cost of interrupts – Checksum offload • Checksum calculation in processor is costly • Offload checksum calculation to NIC (in hardware) – Large Segment offload • Segment large chunks of data to smaller segments is expensive • Offload segmentation and TCP/IP header preparation to NIC • Useful for sender-side TCP – Can support upto ~1Gbps PHYs 10/22/2009 CSE 124 Networked Services Fall 2009 26 Challenges in detail • OS issues – Interrupts • Interrupt moderation • Polling • Hybrid interrupts • Memory – Latency • Memory is slower than processor – Poor cache locality • New Data entering from NIC or application • Cache miss and CPU stall is common • Buffering and copying – Usually two copies required • Application to TCP copy and TCP to NIC copy – Receive side: • Copy can be reduced to one if posted buffers are provided by application • Mostly two copy required – Transmit side: • Zero copy on Transmit (DMA from Application to NIC) can help • Implemented on selected systems 10/22/2009 CSE 124 Networked Services Fall 2009 27 TCP/IP Acceleration Methods • Three main strategies – TCP Offloading Engine (TOE) – TCP Onloading – Stack and NIC enhancements • TCP Offloading Engine – Offload TCP/IP processing to devices attached to the server’s I/O system – Use separate processing and memory resources – Pros • Improves throughput and utilization performance • Useful for bulk data transfer such as IP-storage • Good for few connections with high bandwidth links – Cons • • • • May not scale well to large number of connections Needs special processors (expensive) Needs high memory in NIC (expensive) Store and forward in ToE is suitable only for large transfers Processor Cache memory NIC device TCP Offload Engine – Latency between I/O subsystem and main memory is high 10/22/2009 • Expensive TOEs or NICs requiredServices Fall 2009 CSEare 124 Networked 28 TCP onloading • Dedicate TCP/IP processing to one or more general purpose cores – high performance – Cheap – Main memory to CPU latency is small Core 0 (Applic ation) Core 1 (Applic ation) Core 2 (Applic ation) • Extensible – Programming tools and implementations exist – Good for long term performance • Scalable Core 3 (TCP/IP Processing or onloading) Cache memory NIC device – Good for large number of flows 10/22/2009 CSE 124 Networked Services Fall 2009 29 Stack and NIC enhancements • Asynchronous I/O – Asynchronous call backs on data arrival – Pre-posting buffers by application to avoid copying • Header Splitting – Splitting headers and data – Better data pre-fetching – NIC can place the header • Receive-side scaling – Using multiple cores to achieve connection level parallelism – Have multiple Queues in NIC – Map each queue to mapped to a different processor 10/22/2009 CSE 124 Networked Services Fall 2009 30 Summary Reading assignment • TCP from Chapter 3 in Kurose and Ross • TCP from Chapter 5 in Peterson and Davie • Homework: – Problems P37 and P43 (Pages 306-308) from Kurose and Ross – Deadline: 30th October 2009 10/22/2009 CSE 124 Networked Services Fall 2009 31

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture-9 on 10/22/2009 - Computer Science and Engineering