* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download TCP/IP and Other Transports for High
Survey
Document related concepts
Airborne Networking wikipedia , lookup
Point-to-Point Protocol over Ethernet wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Computer network wikipedia , lookup
Asynchronous Transfer Mode wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Deep packet inspection wikipedia , lookup
Real-Time Messaging Protocol wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Internet protocol suite wikipedia , lookup
Transcript
TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” then look for “Brasov” Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 Structure of the Talks The aim is to give you a picture of how researchers are using high performance networks to support their work. Back to Basics Simple Introduction to Networking TCP/IP on High Bandwidth Long Distance Networks But TCP/IP works ! The effect of packet loss Advanced TCP Stacks Fairness Real Applications on Real Networks Disk-2-disk applications on real networks Memory-2-memory tests Transatlantic disk-2-disk at Gigabit speeds Remote Computing Farms The effect of distance Radio Astronomy e-VLBI Thanks for allowing me to use their slides to: Sylvain Ravot CERN, Les Cottrell SLAC, Brian Tierney LBL, Robin Tasker DL Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 2 Simple Introduction to Networking Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 3 What is a Protocol Stack ? ISO OSI (Open Systems Interconnection) Seven Layer Model defines a framework allowing development of real network protocols A layer… performs unique and specific tasks only has knowledge of those layers immediately above and below uses services of layer below, and provides services to layer above the services defined by a layer are implementation independent – it’s a definition of how things work conceptually communicates with its peer in the remote system Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 4 Encapsulation: The Layering Principle Each protocol layer N adds a Header to the data unit from layer N+1 Header contains control information App data Layer 7: Application user processes Layer 6: Presentation data interpretation, code transformation SH Layer 5: Session Connection, negotiation control Layer 4: Transport End-2-end data transfer & integrity Packet sequencing, flow control Layer 3: Network Addressing, Routing Packet sequencing, flow control Layer 2: Data Link Packet assembly/disassembly Transmission control, Error checking Layer 1: Physical Electrical, Optical, Mechanical PH App data PH App data Segment TH SH PH App data Packet NH TH SH PH App data Frame DH NH TH SH PH App data FCS Bits on the “wire” Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 5 What do the Layers do? Transport Layer: acts as a go-between for the user and network Provides end-to-end data movement & control Gives the level of reliability/integrity need by the application Can ensure a reliable service (which network layer cannot), e.g. assigns sequence numbers to identify “lost” packets Network Layer: deals with logical addressing & the transmission of packets, mechanism for routing. Data Link Layer: provides the synchronization and error checking for the data transmitted over a single physical link (may ensure correct delivery of frames) Going down: fits packets from the network layer above into frames. Going up: Groups bits from the physical layer into frames. Physical Layer: concerned with the transmission of individual bits. Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 6 How do the “IP” Protocols fit together? Application File Transfer Protocol (FTP) RFC 559 TELNET RFC 854 Simple Mail Transfer Protocol ( Presentation (SMTP) RFC 821 Session) DNS traceroute NFS RFC 1024, 1057 and 1094 User Datagram Protocol (UDP) RFC 768 Transmission Control Protocol (TCP) RFC 793 Internet Control Message Protocol (ICMP) RFC 792 Routing OSPF, BGP Address Resolution Protocols ARP: RFC 826 RARP: RFC 903 Internet Protocol IP RFC 791 Network Data Link ping SNMP RFC 1157 DNS POP3/IMAP HTTP Transport TFTP RFC 783 ssh Ethernet Token Ring Network Interface Cards ISDN FDDI SMDS ATM SDH/SONET xDSL Transmission Mode Physical TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 7 Some of the “IP” Protocols Transmission Control Protocol. TCP provides application programs access to the network using a reliable, connection-oriented transport layer service. User Datagram Protocol. UDP provides unreliable, connectionless delivery service using the IP protocol to transport messages between machines. It adds the ability to distinguish among multiple destinations on a single host computer. Internet Protocol. IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort, connection-less delivery service. Internet Control Message Protocol. ICMP allows internet routers to transmit error messages and test messages. Internet Group Message Protocol. IGMP is used with multicast to send UDP datagrams to multiple hosts. Address Resolution Protocol. ARP translates between the 32 bit IP address and a 48 bit LAN address. Reverse Address Resolution Protocol. RARP translates between the 48 bit LAN address and the 32 bit IP address. Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 8 The Physical Layer 1: Ethernet Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 9 The Link Layer 2: Ethernet Frame Frame header IP Datagram FCS 12 bytes Inter Frame Gap Preamble, which is comprised of 56 bits of alternating 0s and 1s. The preamble provides all the nodes on the network a signal against which to synchronize. Start Frame delimiter, which marks the start of a frame. The start frame delimiter is 8 bits long with the pattern10101011 Media Access Control (MAC) Address Every Ethernet network card has, built into its hardware, a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe. The DA and SA define the path across the link Length/Type field two octets long. If the value =< 1500 (0x05dc hex) indicates the length of data If the value > 1500 indicates network-layer protocol : “Ethernet Types” Data, the reason the frame exists. MTU Maximum Transport Unit Frame Check Sequence to protect the frame contents 10 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester The Link Layer: Ethernet VLANs VLANS are logical networks built over the same physical cable plant. Distinguishes Ethernet frames between their logical networks using VLAN header VLAN is defined by the use of value 0x8100 in the Type field location. The next two octets are composed of the following three fields: User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame. This is utilized to define and deliver a class of service Canonical format indicator This is 1 bit in length. Just **don’t** ask!!! VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame. The original Length/Type field will then follow the inserted VLAN tag. Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 11 The Network Layer 3: IP IP Layer properties: Provides best effort delivery It is unreliable Packet may be lost Duplicated Out of order Connection less Provides logical addresses Provides routing Demultiplex data on protocol number Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 12 The Internet datagram Frame header IP header 0 4 Vers Hlen 8 16 Type of serv. Transport FCS 24 19 Total length 31 Identification Flags Fragment offset TTL Protocol Header Checksum Source IP address Destination IP address IP Options (if any) 20 Bytes Padding Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 13 IP Datagram Format (cont.) Vers Hlen TOS. Total length Identification Flags Fragment offset Type of Service – TOS: TTL Protocol Header Checksum now being used for QoS Source IP address Destination IP address Total length: length of datagram IP Options (if any) Padding in bytes, includes header and data Time to live – TTL: specifies how long datagram is allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops Protocol: specifies the format of the data area Protocol numbers administered by central authority to guarantee agreement, e.g. ICMP=1, TCP=6, UDP=17 … Source & destination IP address: (32 bits each) contain IP address of sender and intended recipient Options: (variable length) Mainly used to record a route, or timestamps, or specify routing Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 14 Internet Class-based addresses An Address looks like 192.168.22.123 Class A: large number of hosts, few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh 7 network bits (0 and 127 reserved, so 126 networks), 24 host bits (> 16M hosts/net) Initial byte 1-127 (decimal) Class B: medium number of hosts and networks 10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh 16,384 class B networks, 65,534 hosts/network Initial byte 128-191 (decimal) Class C: large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh 2,097,152 networks, 254 hosts/network Initial byte 192-223 (decimal) Class D: Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh Initial byte 224-239 (decimal) Class E: Reserved Initial byte 248-255 (decimal) Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 15 The Transport Layer 4: UDP UDP Provides : Connection less service over IP No setup teardown One packet at a time Minimal overhead – high performance Provides best effort delivery It is unreliable: Packet may be lost Duplicated Out of order Application is responsible for Data reliability Flow control Error handling Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 16 UDP Datagram format Frame header IP header 0 8 Source port UDP header 16 Application data 24 Destination port FCS 31 8 Bytes UDP message len Checksum (opt.) Source/destination port: port numbers identify sending & receiving processes Port number & IP address allow any application on Internet to be uniquely identified Ports can be static or dynamic Static (< 1024) assigned centrally, known as well known ports Dynamic Message length: in bytes includes the UDP header and data (min 8 max 65,535) Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 17 The Transport Layer 4: TCP TCP RFC 768 RFC 1122 Provides : Connection orientated service over IP During setup the two ends agree on details Explicit teardown Multiple connections allowed Reliable end-to-end Byte Stream delivery over unreliable network It takes care of: Lost packets Duplicated packets Out of order packets TCP provides Data buffering Flow control Error detection & handling Limits network congestion Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 18 The TCP Segment Format Frame header IP header 0 4 8 10 Source port TCP header Application data 24 16 FCS 31 Destination port Sequence number Acknowledgement number Hlen Resv Code Window Checksum Urgent ptr Options (if any) 20 Bytes Padding Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 19 TCP Segment Format – cont. Source/Dest port: TCP port numbers to ID applications at both ends of connection Sequence number: First byte in segment from sender’s byte stream Acknowledgement: identifies the number of the byte the sender of this segment expects to receive next Code: used to determine segment purpose, e.g. SYN, ACK, FIN, URG Window: Advertises how much data this station is willing to accept. Can depend on buffer space remaining. Source port Destination port Options: used for window scaling, Sequence number SACK, timestamps, Acknowledgement number maximum segment size etc. Hlen Resv Code Checksum Options (if any) Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Window Urgent ptr Padding 20 TCP – providing reliability Positive acknowledgement (ACK) of each received segment Sender keeps record of each segment sent Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” Sender starts timer when it sends segment – so can re-transmit Receiver Sender Segment n Sequence 1024 Length 1024 RTT ACK of Segment n Ack 2048 Segment n+1 Sequence 2048 Length 1024 RTT ACK of Segment n +1 Ack 3072 Time Inefficient – sender has to wait Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 21 Flow Control: Sender – Congestion Window Uses Congestion window, cwnd, a sliding window to control the data flow Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important. ACK gives next sequence no to receive AND The available space in the receive buffer Timer kept for each packet TCP Cwnd slides Data sent and ACKed Unsent Data Sent Data buffered waiting ACK may be transmitted immediately Received ACK advances trailing edge Sending host advances marker as data transmitted Data to be sent, waiting for window to open. Application writes here Receiver’s advertised window advances leading edge Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 22 Flow Control: Receiver – Lost Data Lost data Application reads here Data given to application Window slides ACKed but not given to user Next byte expected Expected sequence no. Received but not ACKed Last ACK given Receiver’s advertised window advances leading edge If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 23 How it works: TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size, the higher the throughput Throughput = Window size / Round-trip Time exponentially increase the congestion window size until a packet is lost cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4 Time to reach cwnd size W = RTT*log2 (W) Rate doubles each RTT packet loss timeout CWND slow start: exponential increase congestion avoidance: linear increase retransmit: slow start again Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester time 24 How it works: TCP Congestion Avoidance additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 /MTU for each ACK – linear increase in rate TCP takes packet loss as indication of congestion ! multiplicative decrease: cut the congestion window size aggressively if a packet is lost Standard TCP reduces cwnd by 0.5 Slow start to Congestion avoidance transition determined by ssthresh packet loss timeout CWND slow start: exponential increase congestion avoidance: linear increase retransmit: slow start again Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester time 25 TCP Fast Retransmit & Recovery Duplicate ACKs are due to lost segments or segments out of order. Fast Retransmit: If the sender transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected) Send the missing segment Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwnd Set cwnd to half original value on new ACK no need to go into “slow start” again At steady state, CWND oscillates around the optimal window size With a retransmission timeout, slow start is triggered again packet loss timeout CWND slow start: exponential increase congestion avoidance: linear increase retransmit: slow start again time Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 26 TCP: Simple Tuning - Filling the Pipe Remember, TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on: Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth Round Trip Time (RTT) The number of bytes in flight to fill the entire path: Bandwidth*Delay Product BDP = RTT*BW Can increase bandwidth by orders of magnitude Windows also used for flow control Receiver Sender RTT ACK Segment time on wire = bits in segment/BW Time Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 27 Congestion control: ACK clocking Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 28 More Information Lectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm Encylopaedia http://www.freesoft.org/CIE/index.htm TCP/IP Resources www.private.org.il/tcpip_rl.html Understanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html Configuring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt Assigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 29 Any Questions? Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 30 Backup Slides Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 31 More Information Some URLs UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ TCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html TCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 32 tcpdump / tcptrace tcpdump: dump all TCP header information for a specified source/destination ftp://ftp.ee.lbl.gov/ tcptrace: format tcpdump output for analysis using xplot http://www.tcptrace.org/ NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools http://www.ncne.nlanr.net/TCP/testrig/ Sample use: tcpdump -s 100 -w /tmp/tcpdump.out host hostname tcptrace -Sl /tmp/tcpdump.out xplot /tmp/a2b_tsg.xpl Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 33 tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time. xplot tool make it easy to zoom in Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 34 Zoomed In View Green Line: ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received. Yellow Ticks track the window advertisements that were the same as the last advertisement. White Arrows represent segments sent. Red Arrows (R) represent retransmitted segments Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 35 TCP Slow Start Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 36