* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Playing Distributed Systems with Memory-to
Dynamic Host Configuration Protocol wikipedia , lookup
Server Message Block wikipedia , lookup
Distributed firewall wikipedia , lookup
Distributed operating system wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Remote Desktop Services wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Building Network-Centric Systems Liviu Iftode Before WWW, people were happy... E-mail, Telnet TCP/IP Emacs NFS TCP/IP CS.umd.EDU CS.rutgers.EDU Mostly local computing Occasional TCP/IP networking with low expectations and mostly non-interactive traffic local area networks: file server (NFS) wide area networks -Internet- : E-mail, Telnet, Ftp Networking was not a major concern for the OS One Exception: Cluster Computing Multicomputers Clusters of computers Cost-effective solution for high-performance distributed computing TCP/IP networking was the headache large software overheads Software DSM not a network-centric system :-( The Great WWW Challenge Web Browsing http://www.Bank.com TCP/IP Bank.com World Wide Web made access over the Internet easy Internet became commercial Dramatic increase of interactive traffic WWW networking creates a network-centric system: Internet server performance: service more network clients availability: be accessible all the time over the network security: protect resources against network attacks Network-Centric Systems Networking dominates the operating system Mobile Systems mobility aware TCP/IP (Mobile IP, I-TCP, etc), disconnected file systems (Coda), adaptation-aware applications for mobility(Odyssey), etc Internet Servers resource allocation (Lazy Receive Processing, Resource Containers), OS shortcuts (Scout, IO-Lite), etc Pervasive/Ubiquitous Systems Tiny OS , sensor networks (Directed Diffusion, etc), programmability (One World, etc) Storage Networking network-attached storage (NASD, etc), peer-to-peer systems (Oceanstore, etc), secure file systems (SFS, Farsite), etc Big Picture Research sparked by various OS-Networking tensions Shift of focus from Performance to Availability and Manageability Networking and Storage I/O Convergence Server-based and serverless systems TCP/IP and non-TCP/IP protocols Local area, wide-area, ad-hoc and application/overlay networks Significant interest from industry Outline TCP Servers Migratory-TCP and Service Continuations Cooperative Computing, Smart Messages and Spatial Programming Federated File Systems Talk Highlights and Conclusions Problem 1: TCP/IP is too Expensive User space 20% Other system calls 9% Network Processing 71% Breakdown of the CPU time for Apache (uniprocessor based Web-server) Traditional Send/Receive Communication App OS send(a) NIC NIC OS App copy(a,send_buf) DMA(send_buf,NIC) interrupt DMA(NIC,recv_buf) copy(recv_buf,b) sender receive(b) receiver A Closer Look User space 20% Hardware Interrupt Processing Software Interrupt 8% Processing 11% IP Receive 0% IP Send 0% Other system calls 9% TCP Receive 7% TCP Send 45% Multiprocessor Server Performance Does not Scale •Throughput (requests/s) •700 Dual Processor •600 Uniprocessor •500 •400 •300 •200 •100 •0 •300 •350 •400 •450 •500 •550 •600 •650 •700 •750 •Offered load (connections/s) Apache Web server 1.3.20 on 1 Way and 2 Way 300MHz Pentium II SMP with repeatedly accessing a static16 KB file TCP/IP-Application Co-Habitation TCP/IP “steals” compute cycles and memory from applications TCP/IP executes in kernel-mode: mode switching overhead TCP/IP executes asynchronously interrupt processing overhead internal synchronization on multiprocessor servers causes execution serialization Cache pollution Hidden “Service-work” TCP packet retransmission TCP ACK processing ARP request service Extreme cases can compromise server performance Receive livelocks Denial-of-service (DoS) attacks Two Solutions Replace TCP/IP with a lightweight transport protocol Offload some/all of the TCP from host to a dedicated computing unit (processor, computer or “intelligent” network interface) Industry: high-performance, expensive solutions Memory-to-Memory Communication: InfiniBand “Intelligent” network interface: TCP Offloading Engine(TOE) Cost-effective and flexible solutions: TCP Servers Memory-to-Memory(M-M) Communication Sender Send Receiver Receive TCP/IP Application OS Network Interface (NIC) Remote DMA M-M OS NIC Memory Buffer OS NIC Memory-to-Memory Communication is Non-Intrusive App NIC NIC App RDMA_Write(a,b) b is updated Sender: low overhead Receiver: zero overhead TCP Server at a Glance A software offloading architecture using existing hardware Basic idea: Dedicate one or more computing units exclusively for TCP/IP Compared to TOE track technology better: latest processors flexible: adapt to changing load conditions cost-effective: no extra hardware Isolate application computation from network processing Eliminate network interrupts and context switches Efficient resource allocation Additional performance gains (zero-copy) with extended socket API Related work Very preliminary offloading solutions: Piglet, CSP Socket Direct Protocol, Zero-copy TCP Two TCP Server Architectures TCP Servers for Multiprocessor Servers TCP/IP TCP-Server Server Appl CPU CPU Shared Memory TCP Servers for Cluster-based Servers TCP/IP M-M TCP-Server Server Appl Where to Split TCP/IP Processing? (How much to offload?) APPLICATION Application Processors SEND TCP Servers SYSTEM CALLS RECEIVE copy_from_application_buffers copy_to_application_buffers TCP_send TCP_receive IP_send IP_receive packet_scheduler software_interrupt_handler setup_DMA interrupt_handler packet_out packet_in Evaluation Testbed Multiprocessor Server 4-Way 550MHz Intel Pentium II system running Apache 1.3.20 web server on Linux 2.4.9 NIC : 3-Com 996-BT Gigabit Ethernet Used sclients as a client program [Banga 97] Comparative Throughput 3500 Throughput (requests/sec) 3000 2500 2000 1500 1000 500 0 Uniprocessor SMP 4 processors SMP - 1 TCP Server SMP - 2 TCP Servers Clients issue file requests according to a web server trace Adaptive TCP Servers Static TCP Server configuration Too few TCP Servers can lead to network processing becoming the bottleneck Too many TCP Servers lead to degradation in performance of CPU intensive applications Dynamic TCP Server configuration Monitor the TCP Server queue lengths and system load Dynamically add or remove TCP Server processors Next Target: The Storage Networking Storage Networking dilemma TCP TCP Offloading or not TCP? M-M Communication (InfiniBand) iSCSI (SCSI over IP) DAFS (Direct Access File Systems) non-TCP/IP solutions require new wiring or tunneling over IP-based Ethernet networks TCP/IP solutions require TCP offloading Future Work: TCP Servers & iSCSI Server Appl TCP-Server & iSCSI CPU CPU SCSI Storage iSCSI TCP/IP Shared Memory Use TCP-Servers to connect to SCSI storage using iSCSI protocol over TCP/IP networks Problem 2: TCP/IP is too Rigid Server vs. Service Availability client interested in Service availability Adverse conditions may affect service availability internetwork congestion or failure servers overloaded, failed or under DoS attack TCP has one response network delays => packet loss => retransmission TCP limits the OS solutions for service availability early binding of service to a server client cannot switch to another server for sustained service after the connection is established Service Availability through Migration Server 1 Client Server 2 Migratory TCP at a Glance Migratory TCP migrates live connections among cooperative servers Migration mechanism is generic (not application specific) lightweight (fine-grained migration) and low-latency Migration triggered by client or server Servers can be geographically distributed (different IP addresses) Requires changes to the server application Totally transparent to the client application Interoperates with existing TCP Migration policies decoupled from migration mechanism Basic Idea: Fine-Grained State Migration Server1 Process Application state Connection state Client C1 C2 C3 C4 C5 C6 Server2 Process Migratory-TCP (Lazy) Protocol Server 1 Client Server 2 Non-Intrusive Migration Migrate state without involving old-server application (only old server OS) Old server exports per-connection state periodically Connection state and Application state can go out of sync Upon migration, new server imports the last exported state of the migrated connection OS uses connection state to synchronize with application Non-intrusive migration with M-M communication uses RDMA read to extract state from the old server with zero-overhead works even when the old server is overloaded or frozen Service Continuation (SC) Front-End Server Process socket SC pipe SC API pipe exported state exported state Connection state Back-End Server Process2 Back-End Server Process1 Pipe state sc= create_cont(C1); p1=pipe(); associate(sc,p1); fork_exec(Process1); …. export(sc,state) sc= open_cont(p1); … exported state Pipe state sc= open_cont(p2); …. export(sc, state) export(sc,state) Related Work Process migration: Sprite [Douglis ‘91], Locus [Walker ‘83], MOSIX [Barak ‘98], etc. VM migration [Rosemblum ‘02, Nieh ‘02] Migration in web server clusters [Snoeren ‘00, Luo ‘01] Fault-tolerant TCP [Alvisi ‘00] TCP extensions for host mobility: I-TCP [Bakre ‘95], Snoop TCP [Balakrishnan ‘95], end-to-end approaches [Snoeren ‘00], Msocks [Maltz ‘98] SCTP (RFC 2960) Evaluation Implemented SC and M-TCP in FreeBSD kernel Integrated SC in real Internet servers web, media streaming, transactional DB Microbenchmark impact of migration on client perceived throughput for a two-process server using TTCP Real applications sustain web server throughput under load produced by increasing the number of client connections Impact of Migration on Throughput 8,000 SC size 1 KB SC size 5 KB SC size 10 KB Effective throughput (KB/s) 7,900 7,800 7,700 7,600 7,500 7,400 7,300 No migration 2 5 Migration period (s) 10 Web Server Throughput Throughput(replies/s) 800 700 12,000 Migrated Connections M-Apache Apache 10,000 600 8,000 500 6,000 400 300 4,000 200 2,000 100 0 0 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 Offered load (connections/s) Migrated connections 900 Future Research: Use SC to Build Self-Healing Cluster-based Systems SC2 SC3 Problem 3: Computer Systems move Outdoors Sensors Linux Watch Linux Camera Linux Car Massive numbers of computers will be embedded everywhere in the physical world Dynamic ad-hoc networking How to execute user-defined applications over these networks? Outdoor Distributed Computing Traditional distributed computing has been indoor Target: performance and/or fault tolerance Stable configuration, robust networking (TCP/IP or M-M) Relatively small scale Functionally equivalent nodes Message passing or shared memory programming Outdoor Distributed Computing Target: Collect/Disseminate distributed data and/or perform collective tasks Volatile nodes and links Node equivalence determined by their physical properties (content-based naming) Data migration is not good expensive to perform end-to-end transfer control too rigid for such a dynamic network Cooperative Computing at a Glance Distributed computing with execution migration Smart Message: carries the execution state (and possibly the code) in addition to the payload execution state assumed to be small (explicit migration) code usually cached (few applications) Nodes “cooperate” by allowing Smart Messages to execute on them to use their memory to store “persistent” data (tags) Nodes do not provide routing Smart Message executes on each node of its path Application executed on target nodes (nodes of interest) Routing executed on each node of the path (self-routing) During its lifetime, an application generates at least one, possibly multiple, smart messages Smart vs. “Dumb” Messages Mary’s lunch: Appetizer Entree Dessert Data migration Execution migration •` Smart Messages Hot Hot Hot SM Execution 0 1 Application do migrate(Hot_tag,timeout); Water_tag = ON; N=N+1; until (N==3 or timeout); 1 1 2 2 3 Routing migrate(tag,timeout) { do if (NextHot_tag) sys_migrate(NextHot_tag,timeout); else { spawn_SM(Route_Discovery,Hot); block_SM(NextHot_tag,timeout); until (Hot_tag or timeout); } Cooperative Node Architecure SM Arrival Admission Manager Virtual Machine SM Migration Scheduling Tag Space OS & I/O Admission control for resource security Non-preemptive scheduling with timeout-kill Tags created by SMs (limited lifetime) or I/O tags (permanent) global tag name space {hash(SM code), tag name} five protection domains defined using hash(SM code), SM source node ID, and SM starting time. Related Work Mobile agents (D’Agents, Ajanta) Active networks (ANTS, SNAP) Sensor networks (Diffusion, TinyOS, TAG) Pervasive computing (One.world) Prototype Implementation 8 HP iPAQs running Linux 802.11 wireless communication Sun Java K Virtual Machine Geographic (simplified GPSR) and On-Demand (AODV) routing user node Routing algorithm Geographic (GPSR) On-demand (AODV) intermediate node node of interest Code not cached (ms) Code cached (ms) 415.6 506.6 126.6 314.7 Completion Time Self-Routing There is no best routing outdoors Depends on application and node property dynamics Application-controlled routing Possible with Smart Messages (execution state carried in the message) When migration times out, the application is upcalled on the current node to decide what to do next Self-Routing Effectiveness (simulation) • geographical routing to reach target regions • on-demand routing within region • application decides when to switch between the two starting node node of interest other node Next Target: Spatial Programming Smart Message: too low-level programming How to describe distributed computing over dynamic outdoor networks of embedded systems with limited knowledge about resource number, location, etc Spatial Programming (SP) design guidelines: space is a first-order programming concept resources named by their expected location and properties (spatial reference) reference consistency: spatial reference-to- resource mappings are consistent throughout the program program must tolerate resource dynamics SP can be implemented using Smart Messages (the spatial reference mapping table carried as payload) Spatial Programming Example Right Hill Left Hill Mobile sprinklers with temperature sensors Hot spot Program sprinklers to water the hottest spot of the Left Hill for(i=0;i<10;i++) if {Left_Hill:Hot}[i].temp > Max_temp Max_temp = {Left_Hill:Hot[I]}.temp; What if <10 hot spots ? Spatial Reference for Hot spots on Left Hill id = i; {Left_Hill:Hot}[id].water = ON; Spatial Reference consistency Problem 4: Manageable Distributed File Systems Most distributed file servers use TCP/IP both for client-server and intra-server communication Strong file consistency, file locking and load balancing: difficult to provide File servers require significant human effort to manage: add storage, move directories, etc Cluster-based file servers are cost-effective Scalable performance requires load balancing Load balancing may require file migration File migration limited if file naming is location-dependent We need a scalable, location-independent and easy to manage cluster-based distributed file system Federated File System at a Glance A2 A2 A3 A3 A3 A1 FedFS Local FS FedFS Local FS Local FS Local FS M-M Interconnect Global file name space over cluster of autonomous local file systems interconnected by a M-M network Location Independent Global File Naming Virtual Directory (VD): union of local directories volatile, created on demand (dirmerge) contains information about files including location (homes of files) assigned dynamically to nodes (managers) supports location independent file naming and file migration Directory Tables (DT): local caches of VD entries (~TLB) usr virtual directory file1 file2 usr file1 Local file system 1 usr local directories file2 Local file system 2 Direct Access File System (DAFS) Federated DAFS Distributed NFS over FedFS Federated DAFS NFS Server DAFS Server FedFS M-M Local FS Direct Access FS (DAFS) Application NFS Client TCP/IP Application NFS Client TCP/IP Application NFS Client TCP/IP Application NFS Server FedFS M-M Local FS DAFS Client + M-M DAFS Server M-M Local FS Application DAFS Client M-M Application DAFS Server DAFS Client FedFS M-M Local FS M-M Application M-M DAFS Client M-M M-M DAFS Server FedFS M-M Local FS NFS Server FedFS M-M Local FS TCP/IP FedFS M-M Local FS M-M M-M Related Work Cluster-based File Systems Frangipani[Thekkath’97], PVFS [Carns’00],GFS, Archipelago [JI’00], Trapeze (Duke) DAFS [NetApp’03,Magoutis’01,02,03] User-level communication in cluster-based network servers [Carrera’02] Experimental Platform Eight node server cluster 800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM SCSI Client Dual processor (300 MHz PII), 512 MB SDRAM Linux-2.4 Servers and Clients equipped with Emulex cLAN adapter (M-M network) Workload I Postmark – Synthetic benchmark Short-lived small files Mix of metadata-intensive operations Postmark outline Create a pool of files Perform transactions – READ/WRITE paired with CREATE/DELETE Delete created files Each Postmark client performs 30,000 transactions Clients distribute requests to servers using a hash function on pathnames Files are physically placed on the node which receives client requests Postmark Throughput •30000 •File size: 2K •Postmark Throughput (txns/sec) •File size: 4K •25000 •File size: 8K •File size: 16K •20000 •15000 •10000 •5000 •0 •0 •1 •2 •3 •4 •5 •Number of Servers •6 •7 •8 •9 Workload II Postmark performs only READ transactions No create/delete operations Federated DAFS does not control file placement No client request sent to file’s correct location Postmark Read Throughput •60000 •PostmarkRead •Postmark Read Throughput (txns/sec) •PostmarkRead - NoCache •50000 •40000 •30000 •20000 •10000 •0 •2 •4 •Number of Servers Next Target: Federated DAFS over the Internet DAFS Server FedFS M-M Local FS Application DAFS Client M-M TCP/IP DAFS Server Application DAFS Client M-M Internet FedFS M-M Local FS Application DAFS Client M-M DAFS Server FedFS M-M Local FS Outline TCP Servers Migratory-TCP and Service Continuations Cooperative Computing, Smart Messages and Spatial Programming Federated File Systems Talk Highlights and Conclusions Talk Highlights Back to Migration Service Continuation: service availability and self-healing clusters Smart Messages: programming dynamic networks of embedded systems Exploit Non-Intrusive M-M Communication TCP offloading State migration Federated file systems Network and Storage I/O Convergence TCP Servers & iSCSI Federated File Systems & M-M Programmability Smart Messages and Spatial Programming Extended Server API: Service Continuation, TCP Servers, Federated file system Conclusions Network-Centric Systems: very promising bordercrossing systems research area Common issues for a large spectrum of systems and networks Tremendous potential to impact industry Aknowledgements UMD students: Andrzej Kochut, Chunyuan Liao, Tamer Nadeem, Iulian Neamtiu and Jihwang Yeo. Rutgers students: Ashok Arumugam, Kalpana Banerjee, Aniruddha Bohra, Cristian Borcea, Suresh Gopalakrisnan, Deepa Iyer, Porlin Kang, Vivek Pathak, Murali Rangarajan, Rabita Sarker, Akhilesh Saxena, Steve Smaldone, Kiran Srinivasan, Florin Sultan and Gang Xu. Post-doc: Chalermek Intanagonwiwat Collaborations at Rutgers: EEL (Ulrich Kremer), DARK (Ricardo Bianchini), PANIC (Rich Martin and Thu Nguyen) Support: NSF ITR ANI-0121416 and CAREER CCR-013366