Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
TCP Offload through Connection Handoff Hyong-Youb Kim ([email protected]) Scott Rixner ([email protected]) Cycles (Base) Cycles (Zero-copy) Cycles (+Checksum Offload) Cycles (ZC+CO) Cycles (+1024 Connections) Cycles (+20ms Latency) Instructions(Base) Instructions (Zero-copy) Instructions (+Checksum Offload) Instructions (ZC+CO) Instructions (+1024 Connections) Instructions (+20ms Latency) Uops (Base) Uops (Zero-copy) Uops (+Checksum Offload) Cycles (SPECweb99) Instructions (SPECweb99) 10000 8000 6000 4000 8000 6000 4000 2000 2000 0 0 System Call TCP IP Eth Driver Total System Call TCP IP Eth Driver Network Stack Performance •Well known issues: Data copies and TCP checksum calculations •New observation: A large number of connections and long network latencies kill processor performance, leading to reduced system performance. •Connection data structures (protocol control blocks, sockets, etc.) overwhelm L2 caches and cause cache misses TCP Offload 1200 1000 CS IBM NASA SPECWEB WC 800 600 CPU DDR DRAM DRAM 400 200 0 4 Total HTTP Content Throughput (Mb/s) 10000 Counts per packet 12000 Counts per packet 12000 HTTP Content Throughput (Mb/s) Performance Issues of TCP/IP Stack 8 16 32 64 128 ? PCI Chipset Offload Processor NIC 256 512 1024 2048 Connections 1200 CS IBM NASA SPECWEB WC 1000 800 •Offload processor runs TCP. It has fast memory for storing connections and can process packet more efficiently than host CPU. 600 400 •Offload processor needs to communicate with host CPU, so there needs to be software interface 200 0 0 5 10 15 20 25 30 35 40 One-way Latency (ms) Web Server Throughput Connection Handoff Interface Design considerations: Offload Policies: 1.NIC has finite compute power NIC is a non-trivial resource. The host OS must manage it through policies. 3.Do not want to complicate host stack software architecture 4.Do not want to modify the sockets API Host OS 2.NIC has finite memory Connection handoff: OS establishes a connection and hands it off to NIC 1.Can control the amount of work 2.Minimal impact on stack architecture NIC Advantages of connection handoff: User Application Socket TCP IP Bypass Ethernet Driver Socket TCP Transmit IP Receive Ethernet Lookup File Socket Socket Buffer Events Common Protocol Control Block Handoff (offload) interface synchronizes sockets TCP Control Block Process information Cached Route TCP Control Block Cached Route Data Structures Standard interface for offload firmware: CPU Experimental Results 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 12000 3.5 5000 3 4000 2.5 3000 2 1.5 2000 1 1000 0.5 TCP IP Ethernet Driver Bypass 10000 4 3.5 8000 3 6000 2.5 2 4000 1.5 1 2000 0.5 0 0 System Call Total TCP IP Ethernet Driver Bypass Total SPECweb99 Cycles (No Handoff) Instructions (No Handoff) L2 Misses (No Handoff) Cycles (Handoff) Instructions (Handoff) L2 Misses (Handoff) To ta l System Call TCP IP Ethernet Driver Bypass Total Counts per packet Writing firmware involves too many low level hardware details. Firmware is not portable. Standard API for firmware can help. 1.Send/receive through MAC 30 16000 L2 misses per packet Counts per packet 7 6 5 4 3 2 1 0 Too many low level details Components: Cycles (Handoff) Instructions (Handoff) L2 Misses (Handoff) 18000 7000 6000 5000 4000 3000 2000 1000 0 DRAM 5 TCP Send: 256 total, 256 offloaded Cycles (No Handoff) Instructions (No Handoff) L2 Misses (No Handoff) SRAM 4.5 0 System Call MAC Cycles (Handoff) Instructions (Handoff) L2 Misses (Handoff) L2 misses per packet 4 0 R ec ei ve Tr a ch ns _n m ic _h it an ch _n do ic ff _r es to ch re _n ic _s ch en _n d ic _r ec vd ch _n ic ch _c _n trl ic _f or w ar ch d _o s_ re cv ch _o s_ ac ch k _o ch s _o _c s_ trl re so ch ur _o ce s_ re st or e Messages per packet Web (No Handoff) Web (Handoff) 6000 Counts per packet Alteon programmable Gigabit Ethernet NIC Cycles (No Handoff) Instructions (No Handoff) L2 Misses (No Handoff) Cycles (Handoff) Instructions (Handoff) L2 Misses (Handoff) Counts per packet Athlon XP CPU, 2GB DRAM, FreeBSD 4.7 CPU DMA 25 14000 12000 20 10000 15 8000 6000 10 4000 5 2000 0 L2 misses per packet Cycles (No Handoff) Instructions (No Handoff) L2 Misses (No Handoff) L2 misses per packet Prototype system: Socket operations occur less frequently than transmit and receive: Reduced PCI message traffic Yes/No Policy objectives: Maximize packet rates, ensure fair allocation of NIC resources, etc. 4.Can achieve zero-copy I/O TCP Send (No Handoff) TCP Send (Handoff) Policy NIC information Socket Socket Buffer Network Stack 3.Socket interface unchanged Connection information 0 System Call TCP IP Ethernet Driver Bypass Total TCP Send: 512 total, 256 offloaded SPECweb99 simulation with a faster NIC: 26% increase in HTTP throughput 2.Read/write through DMA 3.Message exchange through driver 4.CPU and memory abstraction