Download High Performance Distributed Computing Textbook

High Performance Distributed Computing Chapter 1 Introduction: Basic Concepts Objective of this chapter: Chapter one will provide an introduction to the Distributed Systems and how to characterize them. In addition, the chapter will describe the evolution of distributed systems as well as the research challenges facing the design of general purpose high performance distributed systems. Key Terms Complexity, Grid structure, High performance distributed system. 1.1 Introduction The last two decades spawned a revolution in the world of computing, a move away from central mainframe-based computing to network-based computing. Today, workstation servers are quickly achieving the levels of CPU performance, memory capacity, and I/O bandwidth once available only in mainframes at a cost order of magnitude below that of mainframes. Workstations are being used to solve computationally intensive problems in science and engineering that once belonged exclusively to the domain of supercomputers. A distributed computing system is the system architecture that makes a collection of heterogeneous computers or workstations to act and behave as being one single computing system. In such a computing environment, users can uniformly access and name local or remote resources, and run processes from anywhere in the system, without being aware of which computers their processes are running on. Many claims have been made for distributed computing systems. In fact, it is hard to rule out any desirable feature of a computing system that has not been claimed to be offered by a distributed system [Comer et al, 1991]. However, the recent advances in computing, networking and software have made feasible to achieve the following advantages: • Increased Performance: The existence of multiple computers in a distributed system allows applications to be processed in parallel and thus improve the application and system performance. For example, the performance of a file system can be improved by replicating its functions over several computers; the file replication allows several applications to access that file system in parallel. Also, file replication results in distributing the network traffic to access that file over different sites and thus reduces network contention and queuing delays. • Sharing of Resources: Distributed systems enable efficient access for all the system resources. Users can share special purpose and sometimes expensive hardware and 1 High Performance Distributed Computing software resources such as database server, compute server, virtual reality server, multimedia information server and printer server, just to name a few. • Increased Extendibility: Distributed systems can be designed to be modular and adaptive so that for certain computations the system will configure itself to include a large number of computers and resources while in other instances, it will just consist of a few resources. Furthermore, the file system capacity and computing power can be increased incrementally rather than throwing all the system resources to acquire higher performance and capacity systems. • Increased Reliability, Availability and Fault Tolerance: The existence of multiple computing and storage resources in the distributed system makes it attractive and costeffective to introduce redundancy in order to improve the system dependability and faulttolerance. The system can tolerate the failure in one computer by allocating its tasks to another available computer. Furthermore, by replicating system functions, the system can tolerate one or more component failures. • Cost-Effectiveness: The performance of computers have been improving by approximately 50% per year while their cost is decreasing by half every year during the last decade [Patterson and Hennessy, 1994]. Furthermore, the emerging high speed optical network technology will make the development of distributed systems attractive in terms of price/performance ratio when compared to those of parallel computers. The cost-effectiveness of distributed systems has contributed significantly to the failure of the supercomputer industry to dominate the high performance computing market. These advantages or benefits can not be achieved easily because designing a general purpose distributed computing system is several orders of magnitude more complex than the design of centralized computing systems. The design of such systems is a complicated process because of the many options that the designers must evaluate and choose from such as the type of communication network and communication protocol, the type of host-network interface, distributed system architecture (e.g., pool, clientserver, integrated, hybrid), the type of system level services to be supported (distributed file service, transaction service, load balancing and scheduling, fault-tolerance, security, etc.) and the type of distributed programming paradigms (e.g., data model, functional model, message passing or distributed shared memory). Below is a list of the main issues that must be addressed. • Lack of good understanding of distributed computing theory. The field is relatively new and to overcome that we need to experiment with and evaluate all possible architectures to design general purpose reliable distributed systems. Our current approach to design such systems is based on ad-hoc approach and we need to develop systemengineering theory before we can master the design of such systems. Mullender compared the design of a distributed system to the design of a reliable national railway system that took a century and half to be fully understood and mature [Bagley, 1993]. Similarly, distributed systems (which have been around for approximately two decades) 2 High Performance Distributed Computing need to evolve into several generations of different design architectures before their designs, structures and programming techniques can be fully understood and mature. • The asynchronous and independence behavior of the computers complicate the control software that aims at making them operate as one centralized computing system. If the computers are structured in a master-slave relationship, the control software is easier to develop and system behavior is more predictable. However, this structure is in conflict with the distributed system property that requires computers to operate independently and asynchronously. • The use of a communication network to interconnect the computers introduces another level of complexity; distributed system designers need not only to master the design of the computing systems and their software systems, but also to master the design of reliable communication networks, how to achieve efficient synchronization and consistency among the system processes and applications, and how to handle faults in a system composed of geographically dispersed heterogeneous computers. The number of computers involved in the system can vary from a few to hundreds or even hundreds of thousands of computers. In spite of these difficulties, there has been a limited success in designing special purpose distributed systems such as banking systems, on-line transaction systems, and point-ofsale systems. However, the design of a general purpose reliable distributed system that has the advantages of both centralized systems (accessibility, management, and coherence) and networked systems (sharing, growth, cost, and autonomy) is still a challenging task [Stankovic, 1984]. Klienrock [Tannenbaum, 1988] makes an interesting analogy between the human-made computing systems and the brain. He points out that the brain is organized and structured very differently from our present computing machines. Nature has been extremely successful in implementing distributed systems that are far cleverer and more impressive than any computing machines humans have yet devised. We have succeeded in manufacturing highly complex devices capable of highspeed computation and massive accurate memory, but we have not gained sufficient understanding of distributed systems and distributed applications; our systems are still highly constrained and rigid in their construction and behavior. The gap between natural and man-made systems is huge and more research is required to bridge this gap and to design better distributed systems. 3 High Performance Distributed Computing Figure 1.1 An Example of a Distributed Computing System. The main objective of this book is to provide a comprehensive study of the design principles and architectures of distributed computing systems. We first present a distributed system design framework to provide a systematic design methodology for distributed systems and their applications. Furthermore, the design framework decompose the design issues into several layers to enable us to better understand the architectural design issues and the available technologies to implement each component of a distributed system. In addition to addressing the design issues and technologies for distributed computing systems, we will also focus on those that will be viable to build the next generations of wide area distributed systems (e.g. Grid and Autonomic computing systems) as shown in Figure 1.1. 1.2 Characterization of Distributed Systems Distributed systems have been referred to by many different names such as distributed processing, distributed data processing, distributed multiple computer systems, distributed database systems, network-based computing, cooperative computing, clientserver systems, and geographically distributed multiple computer systems [Hwang and Briggs, 1984]. Bagely [Kung, 1992] has reported 50 different definitions of distributed systems. Other researchers feel acceptable to have many different definitions for distributed systems and even warren against having one single definition of a distributed system [Liebowitz, and Carson, 1985; Bagley, 1993]. Furthermore, many different methods have been proposed to define and characterize distributed computing systems and distinguish them from other types of computing systems. In what follows, we present the important characteristics and services that have been proposed to characterize and classify distributed systems. 4 High Performance Distributed Computing • Logical Property. The distributed system is defined as a collection of logical units that are physically connected through an agreeable protocol for executing distributed programs [Liebowitz and Carson, 1985]. The logical notion allows the system components to interact and communicate without knowing their physical locations in the system. • Distribution Property. This approach emphasizes the distribution feature of a distributed system. The word “distributed” implies that something has been spread out or scattered over a geographically dispersed area. At least four physical components of a computing system can be distributed: 1) hardware or processing logic, 2) data, 3) the processing itself, and 4) the control. However, a classification using only the distribution property is not sufficient to define distributed systems and many existing computing systems can satisfy this property. Consider a collection of terminals attached to a mainframe or an I/O processor within a mainframe. A definition that is based solely on the physical distribution of some components of the system does not capture the essence of a distributed system. A proper definition must also take into consideration component types and how they interact. • Distributed System Components. Enslow [Comer, 1991] presents a “research and development” definition of distributed systems that identifies five components of such a system. First, the system has a multiplicity of general-purpose resource components, including both hardware and software resources, that can be assigned to specific tasks on a dynamic basis. Second, there is a physical distribution of the hardware and software resources of the system that interact through a communications network. Third, a highlevel operating system that unifies and integrates the control of the distributed system. Fourth, system transparency that permits services to be requested by name only. Lastly, cooperative autonomy that characterizes the operation of both hardware and software resources. • Transparency Property. Other researchers emphasize the transparency property of the system and the degree to which it looks like a single integrated system to users and applications. Transparency is defined as the technique used to hide the separation from both the users and the application programs so that the system is perceived as one single system rather than a collection of computers. The transparency property is provided by the software structure overlaying the distributed hardware. Tanenbaum and VanRenesse used this property to define a distributed system as one that looks to its users like an ordinary centralized system, but runs on multiple independent computers [Bagley, 1993; Halsall, 1992] The authors of ANSA reference Manual [Borghoff, 1992] defined eight different types of transparencies: 1. Access Transparency: This property allows local and remote files and other objects to be accessed using the same set of operations. 2. Location Transparency: This property allows objects to be accessed without knowing their physical locations. 5 High Performance Distributed Computing 3. Concurrency Transparency: This property enables multiple users or distributed applications to run concurrently without any conflict; the users do not need to write any extra code to enable their applications to run concurrently in the system. 4. Replication Transparency: This property allows several copies of files and computations to exist in order to increase reliability and performance. These replicas are invisible to the users or application programs. The number of redundant copies can be selected dynamically by the system, or the user could specify the required number of replicas. 5. Failure Transparency: This property allows the system to continue its operation correctly in spite of component failures; i.e. it enables users and distributed applications to run for completion in spite some failures in hardware and/or software components without modifying their programs. 6. Migration Transparency: This property allows system components (processes, threads, applications, files, etc.) to move within the system without affecting the operation of users or application programs. This migration is triggered by the system software in order to improve system and/or application desired goals (e.g., performance, fault tolerance, security). 7. Performance Transparency: This property provides the system with the ability to dynamically balance its load and schedule the user applications (processes) in a transparent manner to the users in order to optimize the system performance and/or application performance. 8. Scaling Transparency: This property allows the system to be expanded or shirked without changing the system structure or modifying the distributed applications supported by the system. In this book, we assume a distributed computing system resources might include a wide range of computing resources such as workstations, PC's, minicomputers, mainframes, supercomputers, and other special purpose hardware units. The underlying network interconnecting the system resources can span LAN's, MAN's and even WAN's, can have different topologies (e.g., bus, ring, full connectivity, random interconnect, etc.), and can support a wide range of communication protocols. In high performance distributed computing environments, computers communicate and cooperate with latency and throughput comparable to that experienced in tightly coupled parallel computers. Based on these properties, we define a distributed system as a networked (loosely coupled) system of independent computing resources with adequate software structure to enable the integrated use of these resources toward a common goal. 1.3 Evolution of Distributed Computing Systems 6 High Performance Distributed Computing Distributed computing systems have been evolving for more than two decades and this evolution could be described in terms of four generations of distributed systems: Remote Execution Systems (RES), Distributed Computing Systems (DCS), and High Performance Distributed Systems (Grid computing systems), and Autonomic Computing. Each generation can be distinguished by the type of computers, the communications networks, the software environments and applications that are typically used in that generation. 1.3.1 Remote Execution Systems (RES): First Generation The first generation spans the 1970's era, a time when the first computer networks were being developed. During this era, computers were large and expensive. Even minicomputers would cost tens of thousands of dollars. As a result, most organizations had only a handful of computers that were operated independently from one another and were located in one centralized computing center. The computer network concepts were first introduced around the mid 1970s and the initial computer network research was funded by the federal government. For example, the Defense Advanced Research Projects Agency (DARPA) has funded many pioneered research projects in packet switching including the ARPANET. The ARPANET used conventional point-to-point leased line interconnection. Experimental packet switching over radio networks and satellite communication channels were also conducted during this period. The transmission rate of the networks was slow (typically in the 2400 to 9600 bit per second (bps) range). Most of the software available to the user for information exchange was in providing the capability of terminal emulation and file transfer. Consequently, most of the applications were limited to remote login capability, remote job execution and remote data entry. 1.3.2 Distributed Computing Systems (DCS): Second Generation This era spans approximately the 1980s, where significant advances occurred in the computing, networking and software resources used to design distributed systems. In this period, the computing technology introduced powerful microcomputer systems capable of providing computing power comparable to that of minicomputers and even mainframes at a much lower price. This made microcomputers attractive to design distributed systems that have better performance, reliability, fault tolerance, and scalability than centralized computing systems. Likewise, network technology improved significantly during the 1980s. This was demonstrated by the proliferation and the availability of high-speed local area networks that were operating at 10 and 100 million bits per second (Mbps) (e.g., Ethernet and FDDI). These systems allowed dozens, even hundreds, of computers that varied from mainframes, minicomputers, and workstations to PCs to be connected such that information could be transferred between machines in the millisecond range. The wide area network’s speed was slower than LAN’ speed and it was in the 5600 bps to 1.54 Mbps range. 7 High Performance Distributed Computing During this period, most of the computer systems were running the Berkeley UNIX operating system (referred to as the BSD UNIX) that was developed at the University of California's Berkeley Software Distribution. The Berkeley software distribution became popular because it was integrated with the Unix operating system and also offered more than the basic TCP/IP protocols. For example, in addition to standard TCP/IP applications (such as ftp, telnet, and mail), BSD UNIX offered a set of utilities to run programs or copy files to or from remote computers as if they were local (e.g., rcp, rsh, rexe.). Using the Berkeley socket, it was possible to build distributed application programs that were running over several machines. However, the users had to be aware of the activities on each machine; that is where each machine was located, and how to communicate with it. No transparency or support was provided by the operating system; the responsibility for creating, downloading and running the distributed application was solely performed by the programmer. However, between 1985-1990, a significant progress in distributed computing software tools was achieved. In fact, many of the current message passing tools were introduced during this period such as Parallel Virtual Machine (PVM) developed at Oak ridge National Laboratory, ISIS developed by ISIS/Cornell University, Portable Programming for Parallel Processing (P4), just to name a few. These tools have contributed significantly to the widespread of distributed systems and their applications. With these tools, a large number of specialized distributed systems and applications were deployed in the office, medical, engineering, banking and military environments. These distributed applications were commonly developed based on the client-server model. 1.3.3 High-Performance Distributed Systems: Third Generation This generation will span the 19 90's, which will be the decade where parallel and distributed computing will be unified into one computing environment that we refer to in this book as high performance distributed system. The emergence of high-speed networks and the evolution of processor technology and software tools will hasten the proliferation and the development of high performance distributed systems and their applications. The existing computing technology has introduced processor chips that are capable of performing billions of floating point operations per second (Gigaflops) and are swiftly moving towards the trillion floating point operation per second (Teraflops) goal. A current parallel computer like the IBM ASCI White Pacific Computer at Lawrence Livermore National Laboratory in California can computer 7 trillion math operations a second. Comparable performance can now be achieved in high performance distributed systems. For example, a Livermore cluster contains, 2,304 2.4-GHz Intel Pentium Xeon processors have a theoretical peak speed of 11 trillion floating-point operations per second. In HPDS environment, the computing resources will include several types of computers (supercomputers, parallel computers, workstations and even PC's) that collectively execute the computing tasks of one large-scale application. 8 High Performance Distributed Computing Similarly, the use of fiber optics in computer networks has stretched the transmission rates from 64 Kilobit per second (Kbps) in the 1970s to over 100 Giga bit per second (Gbps) as shown in Figure 1.2. Consequently, this has resulted in a significant reduction in the transmission time of data between computers. For example, it took 80 seconds to transmit a 1 Kbyte data over a 100 Kbps network, but it now takes only 8 milliseconds to transmit the same sized data over a 1 Gbps network. Furthermore, the current movement towards the standardization of terabit networks will make high-speed networks attractive in the development of high performance distributed systems. DWDM 1 Tbit/sec Terabit network Network Bandwidth (Mbit/sec) 10000 Gigabit Networks 1000 FDDI,DQDB 100 Token Rings Ethernet 10 1980 1985 Figure1.2 1990 2000 2002 2003 and beyond Evolution of network technology The software tools used to develop HPDS applications make the underlying computing resources, whether they are parallel or distributed, transparent to the application developers. The same parallel application can run without any code modification in HPDS. Software tools generally fall into three groups on the basis of the service they provide to the programmer. The first class attempts to hide the parallelism from the user completely. These systems consist of parallelizing and vectorizing compilers that exploit the parallelism presented by loops and have been developed mainly for vector computers. The second approach uses shared memory constructs as a means for applications to interact and collaborate in parallel. The third class requires the user to explicitly write parallel programs by message passing. During this period, many tools have been developed to assist the users developing parallel and distributed applications at each stage of the software development life cycle. For each stage, there exist some tools to assist the users with the activities of that stage. The potential application examples cover parallel and distributed computing, national multimedia information server (e.g., national or international multimedia yellow pages server), video-on-demand and computer imaging, just to name a few [Reed, and Fujimoto, 1987]. The critical performance criteria for the real-time distributed applications require extremely high bandwidth and strict requirements on the magnitude and variance of network delay. Another important class of applications that require high performance distributed systems is the National Grand Challenge problems. These 9 High Performance Distributed Computing problems are characterized by massive data sets and complex operations that exceed the capacity of current supercomputers. Figure 1.3 shows the computing and storage requirements for the candidate applications for high performance distributed systems. M emory Capacity 16TB Neuro-science applications 12.3TB Governmental filling 3D forging/welding 8TB 1000 GB 100 GB 10 GB Global change Human Genome Fluid Turbulence vehicle Dynamics Ocean Circulation Viscous Fluid Dynamics Superconductor modeling Semiconductor M odeling Quantum Chromodynamics Vision Vehicle signature 72-Hour Weather 1 GB 100 M B 3D Plasma M odeling Component deterioration model Crash/fire safety 3D microstructure Structural Biology Nuclear applications Prototype 3D physics Pharmaceutical Design M icro aging Chemical Dynamics System Speed 1991 10 Gflops 1993 100 Gflops 1995 1997 1 Tflops 8Tflop 2000 12Tflops 2002 50 Tflops 2005-2009 256-1000 Figure 1.3 Computing and Storage Requirements of HPDS Applications 1.3.3 Autonomic Computing Systems: Fourth Generation The autonomic computing concept was introduced in the early 2000 by IBM [www.research.ibm.com/autonomic]. The basic approach is to build computing systems that are capable of managing themselves; that can anticipate their workloads and adapt their resources to optimize their performance. This approach has been inspired by the human autonomic nervous system that has the ability to self-configure, self-tune and even repair himself without any human conscience involvement. The resources of autonomic systems include a wide range of computing systems, wireless and Internet devices. The applications cover a wide range of applications that touch all aspects of our life such as education, business, government and defense. The field is still in its infancy and it is expected to play an important role in defining the next era of computing. 10 High Performance Distributed Computing Table 1.1 summarizes the main features that characterize computing and network resources, the software support, and applications associated with each distributed system generation. Table 1.1 Evolutions of Distributed Computing Systems Distributed System Generation Remote Execution Systems (RES) Distributed Computing Systems (DCS) High-performance Distributed Systems (HPDC) Autonomic Computing Computing resources Networking resources Software/ Application Mainframe, Minicomputers; Centralized; Few and expensive Workstation & PCs Mainframe, Minicomputers; Distributed; Not expensive Packet switched networks; Slow WAN (2400-9600bps); Few networks Fast LANs & MANs, 100Mbps, FDDI, DQDB, ATM; Fast WANs (1.5Mbps) Large number Terminal emulation; Remote login; Remote data entry Workstations & PCs; Parallel/super computers; Fully distributed Explosive number Computers of all types (PCs, Workstations, Parallel/supercomputers ), cellular phone and Internet devices High-speed LANs, MANs, WANs; ATM, Gigabit Ethernet; Explosive number Fluid turbulence; Climate modeling; Video-On-Demand; Parallel/Distributed Computing Business applications, Internet services, scientific, medical and engineering applications, High-speed LANs, MANs, WANs; ATM, Gigabit Ethernet; wireless and mobile networks Network File Systems (NFS); Message-passing tools (PVM,P4 ,ISIS); On-line transaction systems -Airline reservation, online Banking 1.4 Promises and Challenges of High Performance Distributed Systems The proliferation of high performance workstations and the emergence of high-speed networks (Terrabit networks) have attracted a lot of interest in high performance distributed computing. The driving forces towards this end will be (1) the advances in processing technology, (2) the availability of high speed networks and (3) the increased research directed towards the development of software support and programming environments for distributed computing. Further, with the increasing requirements for computing power and the diversity in the computing requirements, it is apparent that no single computing platform will meet all these requirements. Consequently, future computing environments need to adaptively and effectively utilize the existing heterogeneous computing resources. Only high performance distributed systems provide the potential of achieving such an integration of resources and technologies in a feasible manner while retaining desired usability and flexibility. Realization of this potential requires advances on a number of fronts-- processing technology, network technology and software tools and environments. 11 High Performance Distributed Computing 1.4.1 Processing Technology Distributed computing relies to a large extent on the processing power of the individual nodes of the network. Microprocessor performance has been growing at a rate of 35--70 percent during the last decade, and this trend shows no indication of slowing down in the current decade. The enormous power of the future generations of microprocessors, however, cannot be utilized without corresponding improvements in the memory and I/O systems. Research in main-memory technologies, high-performance disk-arrays, and high-speed I/O channels are therefore, critical to utilize efficiently the advances in processing technology and the development of cost-effective high performance distributed computing. 1.4.2 Networking Technology The performance of distributed algorithms depends to a large extent on the bandwidth and latency of communication among the network nodes. Achieving high bandwidth and low latency involves not only fast hardware, but also efficient communication protocols that minimize the software overhead. Developments in high-speed networks will, in the future, provide gigabit bandwidths over local area networks as well as wide area networks at a moderate cost, and thus increasing the geographical scope of high performance distributed systems. The problem of providing the required communication bandwidth for distributed computational algorithms is now relatively easy to solve, given the mature state of fiberoptics and opto-electronic device technologies. Achieving the low latencies necessary, however, remains a challenge. Reducing latency requires progress on a number of fronts: First, current communication protocols do not scale well to a high-speed environment. To keep latencies low, it is desirable to execute the entire protocol stack, up to the transport layer, in hardware. Second, the communication interface of the operating system must be streamlined to allow direct transfer of data from the network interface to the memory space of the application program. Finally, the speed of light (approximately 5 microseconds per kilometer) poses the ultimate limit to latency. In general, achieving low latency requires a two-pronged approach: 1. Latency Reduction: Minimize protocol-processing overhead by using streamlined protocols executed in hardware and by improving the network interface of the operating system. 2. Latency Hiding: Modify the computational algorithm to hide latency by pipelining communication and computation. These problems are now perhaps most fundamental to the success of high-performance distributed computing, a fact that is increasingly being recognized by the research community. 12 High Performance Distributed Computing 1.4.3 Software Tools and Environments The development of high performance distributed applications is a non-trivial process and requires a thorough understanding of the application and the architecture. Although, an HPDS provides the user with enormous computing power and a great deal of flexibility, this flexibility implies increased degrees of freedom which have to be optimized in-order to fully exploit the benefits of the distributed system. For example, during software development, the developer is required to select the optimal hardware configuration for the particular application, the best decomposition of the problem on the selected hardware configuration, the best communication and synchronization strategy to be used, etc. The set of reasonable alternatives that have to be evaluated in such an environment that is very large and selecting the best alternative among these is a nontrivial task. Consequently, there is a need for a set of simple and portable software development tools which can assist the developer in appropriately distributing the application computations to make efficient use of the underlying computing resources. Such a set of tools should span the software life-cycle and must support the developer during each stage of application development starting from the specification and design formulation stages through the programming, mapping, distribution, scheduling phases, tuning and debugging stages up to the evaluation and maintenance stages. 1.5 Summary Distributed systems applications and deployment have been growing at a fast pace to cover many fields, such as education, industry, finance, medicine and military. Distributed systems have the potential to offer many benefits when compared to centralized computing systems that include increased performance, reliability and fault tolerance, extensibility, cost-effectiveness, and scalability. However, designing distributed systems is more complex than designing a centralized system because of the asynchronous behavior and the complex interaction of their components, heterogeneity, and the use of communication network for their information exchange and interactions. Distributed systems provide designers with many options to choose from and poor designs might lead to poorer performance than centralized systems. Many researchers have studied distributed systems and used different names and features to characterize them. Some researchers used the logical unit concept to organize and characterize distributed systems, while others used multiplicity, distribution, or transparency. However, there is a growing consensus to define a distributed system as a collection of resources and/or services that are interconnected by a communication network and these resources and services collaborate to provide an integrated solution to an application or a service. Distributed systems have been around for approximately three decades and have been evolving since their inception in the 1970s. One can describe their evolution in terms of 13 High Performance Distributed Computing four generations: Remote Execution Systems, Distributed Computing Systems, High Performance Distributed Systems, and Autonomic Computing. Autonomic Computing Systems and High performance distributed systems will be the focus of this book. They utilize efficiently and adaptively a wide range of heterogeneous computing resources, networks and software tools and environments. . These systems will change their computing environment dynamically to provide the computing, storage, and connectivity for large scale applications encountered in business, finance, health care, scientific and engineering fields. 1.6 PROBLEMS 1. Many claims have been attributed to distributed systems since their inception. Enumerate all these claims and then explain which of these claims can be achieved using the current technology and which ones will be achieved in the near future, and which will not be possible at all. 2. What are the features or services that could be used to define and characterize any computing system? Use these features or properties to compare and contrast the computing systems built on the basis of • Single operating system in a parallel computer • Network operating system • Distributed system • High performance distributed system. 3. Describe the main advantages and disadvantages of distributed systems when they are compared with centralized computing systems. 4. Why is it difficult to design general-purpose reliable distributed systems? 5. What are the main driving forces toward the development of high performance distributed computing environment. Describe four classes of applications that will be enabled by high performance distributed systems. Explain why these applications could not run on second-generation distributed systems. 6. What are the main differences between distributed systems and high performance distributed systems? 7. Investigate the evolution of distributed systems and study their characteristics and applications. Based on this study, can you identify the types of applications and any additional features associated with each generation of distributed systems as discussed in Section 3? 8. What are the main challenges facing the design and development of large-scale high performance distributed systems that have 100,000 of resources and/or services? 14 High Performance Distributed Computing Discuss any potential techniques and technologies that can be used to address these challenges. 15 High Performance Distributed Computing References 1. Mullender, S., Distributed Systems, First Edition, Addison-Wesley, 1989. 2. Mullender, S., Distributed Systems, Second Edition, Addison-Wesley, 1993. 3. Patterson and J. Hennessy, Computer Organization Design: the hardware/software interface, Morgan Kaufamm Publishers, 1994. 4. Liebowitz B.H., and Carson, J.H., ``Multiple Processor Systems for Real-Time Applications'', Prentice-Hall, 1985. 5. Umar, A., Distributed Computing, PTR Prentice-Hall, 1993. 6. Enslow, P.H., “What is a “Distributed”Data Processing System?”, IEEE Computer, January 1978. 7. Kleinrock, L., ``Distributed Systems'', Communications of the ACM, November 1985. 8. Lorin, H., `Àspects of Distributed Computer Systems'', John-Wiley and Sons, 1980. 9. Tannenbaum, A.S., Modern Operating Systems, Prentice-Hall, 1992. 10. ANSA 1997, ANSA Reference Mnaual Release 0.03 (Draft), Alvey Advanced Network Systems Architectures Project, 24 Hills Road, Cambridge CB2 1JP, UK. 11. Bell, G.,`Ùltracomputer A Teraflop Before its Time'', Communications of the ACM, pp 27-47, August 1992. 12. Geist, A., ``PVM 3 User's Guide and Reference Manual'', Oak Ridge National Laboratory, 1993. 13. Birman, K. K. Marzullo, `ÌSIS and the META Project'', Sun Technology, Summer 1989. 14. Birman, K., et al ISIS User Guide and Reference Manual, Isis Distributed Systems, Inc, 111 South Cayuga St., Ithaca NY, 1992. 15. Spragins, J.D., Hammond, J.L., and Pawlikowski, K., ``Telecommunications Protocols and Design'', Addison Wesley, 1991. 16 High Performance Distributed Computing 16. McGlynn, D.R., ``Distributed Processing and Data Communications'', John Wiley and Sons, 1978. 17. Tashenberg, C.B., ``Design and Implementation of Distributed-Processing Systems'', American Management Associations, 1984. 18. Hwang, K., and Briggs, F.A., ``Computer Architecture and Parallel Processing'', McGraw-Hill, 1984. 19. Halsall, F., Data Communications, Computer Networks and Open Systems'', Third Edition, Addison-Wesley, 1992. 20. Danthine, A., and Spaniol, O., ``High Performance Networking, IV'', International Federation for Information Processing, 1992. 21. Borghoff, U.M., ``Catalog of Distributed File/Operating Systems'', Springer-Verlag, 1992. 22. LaPorta, T.F., and Schwartz, M., `Àrchitectures, Features, and Implmentations of High-Speed Transport Protocols'', IEEE Network Magazine'', May 1991. 23. Kung, H.T., ``Gigabit Local Area Networks: Communications Magazine'', April 1992. A systems perspective'', `ÌEEE 24. Comer, D.E., Internetworking with TCP/IP, Volume I, Prentice-Hall. 1991. 25. Tannenbaum, A.S., Computer Networks Prentice-Hall, 1988. 26. Coulouris, G.F., Dollimore, J., Distributed Systems: Concepts and Design, AddisonWesley, 1988. 27. Bagley, ``Dont't have this one'', October 1993. 28. Stankovic, J.A., `À Perspective on Distributed Computer Systems'', IEEE Transactions on Computers, December 1984. 29. Andrews, G., ``Paradigms for Interaction in Distributed Programs'', Computing Surveys, March 1991. 30. Chin, R. S. Chanson, ``Distributed Object Based Programming Systems'', Computing Surveys, March 1991. 31. The Random House College Dictionary, Random House, 1975. 32. Shatz, S., Development of Distributed Software, Macmillan, 1993. 17 High Performance Distributed Computing 33. Jain, N. and Schwartz, M. and Bashkow, T. R., ``Transport Protocol Processing at GBPS Rates'', Proceedings of the SIGCOMM Symposium on Communication Architecture and Protocols, August 1990. 34. Reed, D.A., and Fujimoto, R.M., ``Multicomputer Networks Message-Based Parallel Processing'', MIT Press, 1987. 35. Maurice, J.B., ``The Design and Implementation of the UNIX Operating System'', Prentice-Hall, 1986. 36. Ross, `` An overview of FDDI: The Fiber Distributed Data Interface,'' IEEE Journal on Selected Areas in Communications, pp. 1043--1051, September 1989. 18 High Performance Distributed Systems Chapter 2 Distributed System Design Framework Objective of this chapter: Chapter two will present a design methodology model of distributed systems to simplify the design and the development of such systems. In addition, we provide an overview of all the design issues and technologies that can be used to build distributed systems. Key Terms Network, protocol, interface, Distributed system design, WAN, MAN, LAN, LPN, DAN, Circuit switching, Packet Switching, Message Switching, Server Model, Pool Model, Integrated Model, Hybrid Model. 2.1 Introduction In Chapter 1, we have reviewed the main characteristics and services provided by distributed systems and their evolution. It is clear from the previous chapter that there are a lot of confusions on what constitute a distributed system, its main characteristics and services, and their designs. In this chapter, we present a framework that can be used to identify the design principles and the technologies to implement the components of any distributed computing system. We refer to this framework as the Distributed System Design Model (DSDM). Generally speaking, the design process of a distributed system involves three main activities: 1) designing the communication network that enables the distributed system computers to exchange information, 2) defining the system structure (architecture) and the services that enable multiple computers to act and behalf as a system rather than a collection of computers, and 3) defining the distributed programming techniques to develop distributed applications. Based on this notion of the design process, the Distributed System Design Model can be described in terms of three layers: (see Figure 2.1): 1) Network, Protocol, and Interface (NPI) layer; 2) System Architecture and Services (SAS) layer; and 3) Distributed Computing Paradigms (DCP) layer. In this chapter, we describe the functionality and the design issues that must be taken into consideration during the design and implementation of each layer. Furthermore, we organize the book chapters into three parts where each part corresponds to one layer in the DSDM. 1 High Performance Distributed Systems D istrib u ted C o m p u tin g P arad ig m s C o m p u tatio n M o d els F u n ctio n al P arallel D ata P arallel C om m u n ication M od e ls M essag e P assing Sh ared M em o ry Sy stem A rch itectu re an d S erv ices (SA S ) A rch itectu re M o d els S y stem L ev el S erv ices N etw o rk , P ro to co l an d In terface N etw o rk N etw o rk s C o m m u n icatio n Pro to co ls Figure 2.1. Distributed System Design Model. The communication network, protocol and interface (NPI) layer describes the main components of the communication system that will be used for passing control and information among the distributed system resources. This layer is decomposed into three sub-layers: Network Types, Communication Protocols, and Network Interfaces. The distributed system architecture and service layer (SAS) defines the architecture and the system services (distributed file system, concurrency control, redundancy management, load sharing and balancing, security service, etc.) that must be supported and supported in order for the distributed system to behave and function as if it were a single image computing system. The distributed computing paradigms (DCP) layer represents the programmer (user) perception of the distributed system. This layer focuses on the programming paradigms that can be used to develop distributed applications. Distributed computing paradigms can be broadly characterized based on the computation and communication models. Parallel and distributed computations can be described in terms of two paradigms: Functional Parallel and Data Parallel paradigms. In functional parallel paradigm, the computations are assigned to different computers. In data parallel paradigm, all the computers perform the same functions, Same Program Multiple Data Stream (SPMD), but each function operates on different data streams. One can also characterize parallel and distributed computing based on the techniques used for inter-task communications into two main models: Message Passing and Distributed Shared Memory models. In message passing paradigm, tasks communicate with each other by messages, while in distributed shared memory they communicate by reading/writing to a global shared address space. In the following subsections, we describe the design issues and technologies associated with each layer in the DSDM. 2 High Performance Distributed Systems 2.2 Network, Protocol and Interface The first layer (from the bottom-up) in the distributed system design model addresses the issues related to designing the computer network, communications protocols, and host interfaces. The communication system represents the underlying infrastructure used to exchange data and control information among the logical and physical resources of the distributed system. Consequently, the performance and the reliability of distributed system depend heavily on the performance and reliability of the communication system. Traditionally distributed computing systems have relied entirely on local area networks to implement the communication system. Wide area networks were not considered seriously because of their high-latency and low-bandwidth. However, the current emerging technology has changed that completely. Currently, the WAN operate at Terabit per second transmission rates (Tbps) as shown in Figure 1.2. A communication system can be viewed as a collection of physical and logical components that jointly perform the communication tasks. The physical components (network devices) transfer data between the host memory and the communication medium. The logical components provide services for message assembly and/or de-assembly, buffering, formatting, routing and error checking. Consequently, the design of a communication system involves defining the resources required to implement the functions associated with each component. The physical components determine the type of computer network to be used (LAN's, MAN's, WAN's), type of network topology (fully connected, bus, tree, ring, mixture, and random), and the type of communication medium (twisted pair, coaxial cables, fiber optics, wireless, and satellite), and how the host accesses the network resources. The logical components determine the type of communication services (packet switching, message switching, circuit switching), type of information (data, voice, facsimile, image and video), management techniques (centralized and/or distributed), and type of communication protocols. The NPI layer discusses the design issues and network technologies available to implement the communication system components using three sub-layers: 1) Network Type that discusses the design issues related to implement the physical computer network, 2) Communications Protocols that discusses communication protocols designs and their impact on distributed system performance, and 3) Host Network Interface that discusses the design issues and techniques to implement computer network interfaces. 2.2.1 Network Type A computer network is essentially any system that provides communication between two or more computers. These computers can be in the same room, or can be separated by several thousands of miles. Computer networks that span large geographical distances are fundamentally different from those that span short 3 High Performance Distributed Systems distances. To help characterize the differences in capacity and intended use, communications networks are generally classified according to the distance into five categories: 1) Wide Area Network (WAN), 2) Metropolitan Area Network (MAN), 3) Local Area Network (LAN), 4) Local Peripheral Network (LPN), and 5) Desktop Area Network (DAN). Wide area networks (WANs): WANs are intended for use over large distances that could include several national and international private and/or public data networks. There are two types of WANs: Packet switched and Circuit Switched networks. WANs used to operate at slower speeds (e.g., 1.54 Mbps) than LAN technologies and have high propagation delays. However, the recent advances in fiber optical technology, wavelength division multiplexing technology and the wide deployment of fiber optics to implement the backbone network infrastructure have made their transmission rates higher than the transmission rates of LANs. In fact, it is now approaching Pita transmission rates (Pbps). Metropolitan area networks (MANs): MANs span intermediate distances and operate at medium-to-high speeds. As the name implies, a MAN can span a large metropolitan area and may or may not use the services of telecommunications carriers. MANs introduce less propagation delay than WANs and their transmission rates range from 56Kbps to 100 Mbps. Local area networks (LANs): LANs normally used to interconnect computers and different types of data terminal equipment within a single building or a group of buildings or a campus area. LANs provide the highest speed connections (e.g., 100 Mbps, 1 Gbps) between computers because it covers short distances than those covered by WANs and MANs. Most LANs use a broadcast communication medium where each packet is transmitted to all the computers in the network. Local peripheral networks (LPNs): LPNs can be viewed as special types of LANs [Tolmie and Tanlawy, 1994; Stallings et al, 1994] and it covers an area of a room or a laboratory. LPN is mainly used to connect all the peripheral devices (disk drives, tape drives, etc.) with the computers located in that room or laboratory. Traditionally, input/output devices are confined to one computer system. However, the use of high speed networking standards (e.g., HIPPI and Fiber Channels) to implement LPNs has enabled the remote access to the input/output devices. Desktop area networks (DANs): DAN is another interesting concept that aims at replacing the proprietary bus within a computer by a standard network to connect all the components (memory, network adapter, camera, video adapter, sound adapter, etc.) using a standard network. The DAN concept is becoming even more important with the latest development in palm computing devices; the palm computers do not need to have huge amount of memory, sound/video capabilities, all these can be taken from the servers that can be connected to the palm devices using a high speed communication link. 4 High Performance Distributed Systems Network Topology The topology of a computer network can be divided into five types: bus, ring, hub, fully connected and random as shown in Figure 2.2. a . B U S N e tw o r k b . R IN G N e tw o r k s w itc h c . H u b -b a se d N e tw o rk d . F u lly c o n n e c te d s w itc h e. R andom Figure 2.2 Different Types of Network Topologies In a bus-based network, the bus is time-shared among the computers connected to the bus. The control of the bus is either centralized or distributed. The main limitation of the bus topology is its scalability; when the number of computers sharing the bus becomes large, the contention increases significantly that lead to unacceptable communication delays. In a ring network, the computers are connected using point-to-point communication links that form a closed loop. The main advantages of the ring topology include simplified routing scheme, fast connection setup, a cost proportional to the number of interfaces, and provide high throughput [Weitzman, 1980; Halsall, 1992]. However, the main limitation of the ring topology is its reliability, which can be improved by using double rings. In a hub-based or switched-based network, there is one central routing switch that connects an incoming message on one of its input links to its destination through one of the switch output links. This topology can be made hierarchical where a slave switch can act as a master switch for another cluster and so on. With the rapid deployment of switched-based networks (e.g., Gigabit Ethernet), this topology is expected to play an important role in designing high performance distributed systems. In fully connected network, every computer can reach any other computer in one hob. However, the cost is prohibitively especially when the number of computers to be connected is large. The Random network is a type of network 5 High Performance Distributed Systems topology that is a combination of the other types that will lead to an ad-hoc topology. Network Service Computer networks can also be classified according to the switching mechanism used to transfer data within the network. These switching mechanisms can be divided into three basic types: Circuit switching, Message switching and Packet switching. Circuit Switching: Conceptually, circuit switching is similar to the service offered by telephony networks. The communication service is performed in three phases: connection setup, data transmission, and connection release. Circuit-switched connections are reliable and deliver data in the order it was sent. The main advantage of circuit switching is its guaranteed capacity; once the circuit is established, no other network activity is allowed to interfere with the transmission activity and thus can not decrease the capacity of the circuit. However, the disadvantage of Circuit switching is the cost associated with circuit setup and release and the low utilization of network resources. Message Switching: In a message switching system, the entire message is transmitted along a predetermined path between source and destination computers. The message moves in a store-and-forward manner from one computer to another until it reaches its destination. The message size is not fixed and it could vary from few kilobytes to several megabytes. Consequently, the intermediate communication nodes should have enough storage capacity to store the entire message as being routed to its destination. Message switching could result in long delays when the network traffic is heavy and consist of many long messages. Furthermore, the resource utilization is inefficient and it provides limited flexibility to adjust to fluctuations in network conditions [Weitzman, 1980]. Packet Switching: In this approach, messages are divided into small fixed size pieces, called packets that are multiplexed onto the communications links. A packet, which usually contains only a few hundred bytes of data, is divided into two parts: data and header parts. The header part carries routing and control information that is used to identify the source computer, packet type, and the destination computer; this service is similar to the postal service. Users place mail packages (packets) into the network nodes (mailboxes) that identify the source and the destination of the package. The postal workers then use whatever paths they deem appropriate to deliver the package. The actual path traveled by the package is not guaranteed. Like the postal service, a packet-switched network uses best-effort delivery. Consequently, there is no guarantee that the packet will ever be delivered. Also there are typically several intermediate nodes between the source and the destination that will store and forward the packets. As a result the packets sent from one source may not take the same route to the destination, nor may they be delivered in the same transmission order, and they may be duplicated. The main 6 High Performance Distributed Systems advantage of Packet switching is that the communication links are shared by all the network computers and thus improve the utilization of the network resources. The disadvantage is that as network activity increases, each machine sharing the connection receives less of the total connection capacity, which results in a slower communication rate. Furthermore, there is no guarantee that the packets will be received in the same order of their transmission or without any error or duplication. The main difference between circuit switching and packet switching is that in circuit switching there is no need for intermediate network buffer. Circuit switching provides a fast technique to transmit large data, while packet switching is useful to transmit small data blocks between a random number of geographically dispersed users. Another variation of these services is the virtual circuit switching which combines both packet and circuit switching in its service. The communication is done by first establishing the connection, transferring data, and finally disconnecting the connection. However, during the transmission phase, the data is transferred as small packets with headers that define only the virtual circuit these packets are using. In this service, we have the advantages of both circuit switching and packet switching. In fact, the virtual circuit switching is the service adopted in ATM networks where packets are referred to as cells as will be discussed later in Chapter 3. 2.2.2 Communication Protocols A protocol is a set of precisely defined rules and conventions for communication between two parties. A communication protocol defines the rules and conventions that will be used by two or more computers on the network to exchange information. In order to manage the complexity of the communication software, a hierarchy of software layers is commonly used for its implementation. Each layer of the hierarchy is responsible for a well-defined set of functions that can be implemented by a specific set of protocols. The Open Systems Interconnection (OSI) reference model, which is proposed by the International Standards Organization (ISO), has seven layers as shown in Figure 2.3. In what follows, we briefly describe the functions of each layer of the OSI reference model from the bottom-up [Jain, 1993]. 7 High Performance Distributed Systems Application Layer Application Component Presentation Layer Session Layer Transport Layer Transport Component Network Layer Data Link Layer Network Component Physical Layer Figure 2.3 The OSI reference model Physical Layer: It is concerned with transmitting raw bits over a communication channel. Physical layer recognizes only individual bits and cannot recognize characters or multi-character frames. The design issues here largely deal with mechanical, electrical, procedural interfaces and physical transmission medium. The physical layer consists of the hardware that transmits sequences of binary data by analog or digital signaling and using either electric signals, light signals or electro-magnetic signals. Data Link Layer: It defines the functional and procedural methods to transfer data between two neighboring communication nodes. This layer includes mechanisms to deliver data reliably between two adjacent nodes, group bits into frames and to synchronize the data transfer in order to limit the flow of bits from the physical layer. In local area networks, the data link layer is divided into two sublayers: the medium access control (MAC) sub-layer, which defines how to share the single physical transmission medium among multiple computers and the logical link control (LLC) sub-layer that defines the protocol to be used to achieve error control and flow control. LLC protocols can be either bit or character based protocols. However, most of the networks use bit-oriented protocols [Tanenbaum, 1988]. Network Layer: The network layer addresses the routing scheme to deliver packets from the source to the destination. This routing scheme can be either static (the routing path is determined a priori) or dynamic (the routing path is determined based on network conditions). Furthermore, this layer provides techniques to prevent and remove congestion once it occurs in the network; congestion occurs when some nodes receive more packets than they can process and route. In wide area networks, where the source and destination computers could be interconnected by different types of networks, the network layer is responsible for internetworking; that is converting the packets from one network format to another. In a single local area network with broadcast medium, the network layer is redundant and can be 8 High Performance Distributed Systems eliminated since packets can be transmitted from any computer to any other computer by just one hop [Coulouris and Dollimore, 1988]. In general, the network layer provides two types of services to the transport layer: connection-oriented and connectionless services. The connection oriented service uses circuit switching while the connectionless service uses the packet switching technique. Transport Layer: It is an end-to-end layer that allows two processes running on two remote computers to exchange information. The transport layer provides to the higher-level processes efficient, reliable and cost-effective communication services. These services allow the higher level layers to be developed independent of the underlying network-technology layers. The transport layer has several critical functions related to achieving reliable data delivery to the higher layer such as detecting and correcting erroneous packets, delivering packets in order, and providing a flow control mechanism. Depending on the type of computer network being used, achieving these functions may or may not be trivial. For instance, operating over a packet-switching network with widely varying inter-packet delays presents a challenging task for efficiently delivering ordered data packets to the user; in this network, packets will experience excessive delays that makes decision on the cause of the delay a very difficult task. The delay could be caused by a network failure or by the network being congested. The transport protocol’s task is to resolve this issue that could be by itself a time consuming task. The session, presentation and application layers form the upper three layers in the OSI reference model. In contrast to the lower four layers, which are concerned with providing reliable end-to-end communication, the upper layers are concerned with providing user-oriented services. They take error-free channel provided by the transport layer, and add additional features that are useful to a wide variety of user applications. Session Layer: It provides mechanisms for organizing and structuring dialogues between application layer processes. For example, the user can select the type of synchronization and control needed for a session such as alternate two-way or simultaneous operations, establishment of major and/or minor synchronization points and techniques for starting data exchange. Presentation Layer: The main task of this layer focuses on the syntax to be used for representing data; it is not concerned with the semantics of the data. For example, if the two communicating computers use different data representation schemes, this layer task is then to transform data from the formats used in the source computer into a standard data format before transmission. At the destination computer, the received data is transformed from the standard format to the format used by the destination computer. Data compression and encryption for network security are issues of this layer as well [Tanenbaum, 1988; Coulouris and Dollimore, 1988]. Application Layer: This layer supports end-user application processes. This layer contains service elements (protocols) to support application processes such as job 9 High Performance Distributed Systems management function, file transfer protocol, mail service, programming language support, virtual terminal, virtual file system, just to name a few. 2.2.3 Network Interfaces The main function of host-network interface is to transmit data from the host to the network and deliver the data received from the network to the host. Consequently, The host-network interface interacts with upper layer software to perform functions related to message assembly and de-assembly, formatting, routing and error checking. With the advances in processing and memory technology, these communication functions can now be implemented in the hardware of the network interface. A tradeoff is usually made regarding how these functions are going to be distributed between the host and the network interface. The more functions allocated to the network interface, the fewer loads imposed on the host to perform the communication functions; however, the cost of the network interface will increase. The network interface can be a passive device used for temporary storing the received data. In this case, the network interface is under the control of the processor that performs all the necessary functions to transfer the received message to the destination remote process. A more sophisticated network interface can execute most of the communication functions such as assembling complete message packets, passing these packets to the proper buffer, performing flow control, managing the transmit and receive of message packets, interrupting the host when the entire message has been received. In the coming Chapters, we will discuss in more detail the design issues in host network interfaces. 2.3 Distributed System Architectures and Services The main issues addressed in this layer are related to the system architecture and the functions to be offered by the distributed system. The architecture of a distributed system identifies the main hardware and software components of the system and how they interact with each other to deliver the services provided by the system. In addition to defining the system architecture and how its components interact, this layer defines also the system services and functions that are required to run distributed applications. The architecture of a distributed system can be described in terms of several architectural models that define the system structure and how the components collaborate and interact with each other. The components of a distributed system must be independent and be able to provide a significant service or function to the system users and applications. In what follows, we describe the architectural models and the system services and functions that should be supported by distributed systems. 2.3.1 Architectural Models 10 High Performance Distributed Systems The distributed system architectural models can be broadly grouped into four models: Server Model, Pool Model, Integrated Model, and Hybrid Model [colorois, ohio-os]. Server Model The majority of distributed systems that have been built so far are based on the server model (which is also referred to as the workstation or the client/server model). In this model each user is provided with a workstation to run the application tasks. The need for workstations is primarily driven by the user requirements of a high-quality graphical interface and guaranteed application response time. Furthermore, the server model supports sharing the data between users and applications (e.g., shared file servers and directory servers). The server model consists of workstations distributed across a building or a campus and connected by a local area network (see Figure 2.4). Some of the workstations could be located in offices, and thus be tied to a single user, whereas others may be in public areas where are used by different users. In both cases, at any instant of time, a workstation is either setting idle or has a user logged into it. file se rv e r p rin tin g se rv e r co m p u tin g serv e r N etw o r k w o rk sta tio n s . . . Figure 2.4 Server Model In this architecture, we do need communication software to enable the applications running on the workstations to access the system servers. The term server refers to application software that is typically running on a fast computer that offers a set of services. Examples of such servers include compute engines, database servers, authentication/authorization servers, gateway servers, or printers. For example, the service offered by an authentication/authorization server is to validate user identities and authorize access to system resources. In this model, a client sends one or more requests to the server and then waits for a response. Consequently, distributed applications are written as a combination of clients and servers. The programming in this model is called synchronous programming. The server can be implemented in two ways: single or concurrent 11 High Performance Distributed Systems server. If the server is implemented as a single thread of control, it can support only one request at a time; that is a client request that finds its server busy must wait for all the earlier requests to complete before its request can be processed. To avoid this problem, important servers are typically implemented as concurrent servers; the services are developed using multiple lightweight processes (that we refer to interchangeably as threads) in order to process several requests concurrently. In concurrent server, after the request is received a new thread (child thread) is created to perform the requested service whereas the parent thread keeps listening at the same port for the next service requests. It is important to note that the client machine participates significantly in the computations performed in this model; that is not all the computations are done at the server and the workstations are acting as input/output devices. Pool Model An alternative approach to organize distributed system resources is to construct a processor pool. The processor pool can be a rack full of CPUs or a set of computers that are located in a centralized location. The pool resources are dynamically allocated to user processes on demand. In the server model, the processing powers of idle workstations cannot be exploited or used in a straightforward manner. However, in the processor pool model, a user process is allocated CPUs or computing resources as much as needed and when that process is finished, all its computing resources are returned to the pool so other processes can use them. There is no concept of ownership here; all the processors belong equally to every process in the system. Consequently, the processor pool model does not need any additional software to achieve load balancing as it is required in the server model to improve the system utilization and performance, especially when the number of computing resources is large; when the number is large, the probability of finding computers idle or lightly loaded is typically high. In the pool model, programs are executed on a set of computers managed as a processor service. Users are provided with terminals or low-end workstations that are connected to the processor pool via a computer network as shown in Figure 2.5. 12 High Performance Distributed Systems processor pool supercomputer servers multicomputer workstations ... . . ... . Network : processor array . . . terminals Figure 2.5 Processor Pool Model The processor pool model provides a better utilization of resources and increased flexibility, when compared to the server model. In addition, programs developed for centralized systems are compatible with this model and can be easily adapted. Finally, processor heterogeneity can be easily incorporated into the processor pool. The main disadvantages of this model are the increased communication between the application program and the terminal, and the limited capabilities provided by the terminals. However, the wide deployment of high speed networks (e.g., Gigabit Ethernet) will make the remote access to the processor pool resources (e.g., supercomputers, high speed specialized servers) is attractive and cost-effective. Furthermore, the introduction of hand-held computers (palm, cellular, etc.) will make this model even more important; we can view the hand-held computers as terminals and most of the computations and the services (e.g., Application Service Provides) are provided by the pool resources. Integrated Model The integrated model brings many of the advantages of using networked resources and centralized computing systems to distributed systems by allowing users to access different system resources in a manner similar to that used in a centralized, single-image, multi-user computing system. In this model each computer is provided with appropriate software so it can perform both the server and the client roles. The system software located in each computer is similar to the operating system of a centralized multi-user computing system, with the addition of networking software. In the integrated model, the set of computing resources forming the distributed system are managed by a single distributed operating system that makes them appear to the user as a single computer system, as shown in Figure 2.6. The individual computers in this model have a high degree of autonomy and run a 13 High Performance Distributed Systems complete set of standard software. A global naming scheme that is supported across the distributed system allows individual computers to share data and files without regard to their location. The computing and storage resources required to run user applications or processes are determined at runtime by the distributed operating system such that the system load is balanced and certain system performance requirement is achieved. However, the main limitation of this approach is the requirement that the user processes across the whole system must interact using only one uniform software system (that is the distributed operating system). As a result, this approach requires that the distributed operating system be ported to every type of computer available in this system. Further, existing applications must be modified to support and interoperate with the services offered by the distributed operating system. This requirement limits the scalability of this approach to develop distributed systems with large number of heterogeneous logical and physical resources. w o r k s ta tio n s serv ers . . . N e tw o rk . . . c o n c e n tr a to r te r m in a ls s u p e r c o m p u te r Figure2.6 Integrated Model Hybrid Model This model can be viewed as a collection of two or more of the architectural models discussed above. For example, the server and pool models can be used to organize the access and the use of the distributed system resources. The Amoeba system is an example of such a system. In this model, users run interactive applications on their workstations to improve user response time while other applications run on several processors taken from the processor pool. By combining these two models, the hybrid model has several advantages: providing the computing resources needed for a given application, parallel processing of user tasks on pool’s processors and the ability to access the system resources from either a terminal or a workstation. 2.3.2 System Level Services The design of a distributed computing environment can follow two approaches: topdown or bottom up. The first approach is desirable when the functions and the services of a distributed system are well defined. It is typically used when designing special-purposed distributed applications. The second approach is desirable when the system is built using existing computing resources running traditional operating systems. The structure of the existing operating systems (e.g., Unix) is usually designed to support a centralized time-sharing environment and does not support the distributed computing environment. An operating system is the software that provides the functions that allow resources to be shared between tasks, and provides a level of abstraction above the computer hardware that facilitates the 14 High Performance Distributed Systems use of the system by user and applications programs. However, the required system-level services are greater in functionality than might normally exist in an operating system. Therefore, a new set of system-wide services must be added on top of the individual operating systems in order to run efficiently distributed applications. Examples of such services include distributed file system, load balancing and scheduling, concurrency control, redundancy management, security service, just to name a few. The distributed file system allows the distributed system users to transparently access and manipulate files regardless of their locations. The load scheduling and balancing involves distributing the loads across the overall system resources such that the overall load of the system is well balanced. The concurrency control allows concurrent access to the distributed system resources as if they were accessed sequentially (serializable access). Redundancy management addresses the consistency and integrity issues of the system resources when some system files or resources are redundantly distributed across the system to improve performance and system availability. The security service involves securing and protecting the distributed system services and operations by providing the proper authentication, authorization, and integrity schemes. In Part II chapters, we will discuss in detail the design and implementation issues of these services. 2.4 Distributed Computing Paradigms In the first layer of the distributed system design model, we address the issues related to designing the communication system, while in the second layer, we address the system architecture and the system services to be supported by a distributed system. In the third layer, we address the programming paradigms and communication models needed to develop parallel and distributed applications. The distributed computing paradigms can be classified according to two models: Computation and Communication models. The computation model defines the programming model to develop parallel and distributed applications while the communication model defines the techniques used by processes or applications to exchange control and data information. The computation model describes the techniques available to the users to decompose and run concurrently the tasks of a given distributed application. In broad terms, there are two computing models: Data Parallel, and Functional Parallel. The communication model can be broadly grouped into two types: Message Passing, and Shared Memory. The underlying communication system can support either one or both of these paradigms. However, supporting one communication paradigm is sufficient to support the other type; a message passing can be implemented using shared memory and vice versa. The type of computing and communication paradigms used determine the type of distributed algorithms that can be used to run efficiently a given distributed application; what is good for a message passing model might not be necessarily good when it is implemented using shared memory model. 2.4.1 Computation Models 15 High Performance Distributed Systems Functional Parallel Model In this model, the computers involved in a distributed application execute different threads of control or tasks, and interact with each other to exchange information and synchronize the concurrent execution of their tasks. Different terms have been used in the literature to describe this type of parallelism such as control parallelism, and asynchronous parallelism. Figure 2.7(a) shows the task graph of a distributed application with five different functions (F1-F5). If this application is programmed based on the functional parallel model and run on two computers, one can allocate functions F1 and F3 to computer 1 and functions F2, F4 and F5 to computer 2 (see Figure 2.7(b)). In this example, the two computers must synchronize their executions such that computer 2 can execute function F5 only after functions F2 and F4 have been completed and computer 2 has received the partial results from computer 1. In other words, the parallel execution of these functions must be serializable; that is the parallel execution of the distributed application produces identical results to the sequential execution of this application [Casavant, et al, 1996; Quinn, 1994]. Shared Data START START Computer 1 F1 F2 F1 F2 F3 F4 F3 F4 F5 F5 Computer 2 END (a) (a) END (b) (b) 16 High Performance Distributed Systems Partitioned Shared Data Computer 2 Computer 1 START START F1 F2 F1 F2 F3 F4 F3 F4 F5 F5 END END (c ) Figure2.7 (a) A block of a Task with five functions, (b) Functional Parallel Model, and (c) SPMD Data Parallel Model. Another variation to the functional parallel model is the host-node programming model. In this model, the user writes two programs: the host and node programs. The host program controls and manages the concurrent execution of the application tasks by downloading the node program to each computer as well as the required data. In addition, the host program receives the results from the node program. The node program contains most of the compute-intensive tasks of the application. The number of computers that will run the node program is typically determined at runtime. In general, parallel and distributed applications developed based on the functional parallel model might lead to race conditions and produce imbalance conditions; this occurs because the task completion depends on many variables such as the task size, type of computer used, memory size available, current load on the communication system, etc. Furthermore, the amount of parallelism that can be supported by this paradigm is limited by the number of functions associated with the application. The performance of a distributed application can be improved by decomposing the application functions into smaller functions; that depends on the type of application and the available computing and communication resources. Data Parallel Model In the data parallel model, which is also referred to as synchronous model, the entire data set is partitioned among the computers involved in the execution of a distributed application such that each computer is assigned a subset of the whole data sets [Hillis and Steele, 1986]. In this model, each computer runs the same program but each operates on a different data set, referred to as Single Program Multiple Data (SPMD). Figure 2.7(c) shows how the distributed application shown 17 High Performance Distributed Systems in Figure 2.7 (a) can be implemented using data parallel model. In this case, every computer executes the five functions associated with this application, but each computer operates on different data sets. Data parallel model has been argued favorably by some researchers because it can be used to solve large number of important problems. It has been shown that the majority of real applications can be solved using data parallel model [Fox, Williams and Messina, 1994]. Furthermore, it is easier to develop applications based on data parallel paradigm than those written based on the functional parallel paradigm. In addition, the amount of parallelism that can be exploited in functional parallel model is fixed and is independent of the size of the data sets, whereas in the data parallel model, the data parallelism increases with the size of the data [Hatcher and Quinn, 1991]. Other researchers favored the functional parallel model since all large scale applications can be mapped naturally into functional paradigm. In summary, we do need to efficiently exploit both the functional and data parallelism in a given large distributed application in order to achieve a high performance distributed computing environment. 2.4.2 Distributed Communications Models Message Passing Model The Message Passing model uses a micro-kernel (or a communication library) to pass messages between local and remote processes as well as between processes and the operating system. In this model, messages become the main technique for all interactions between a process and its environment, including other processes. In this model, application developers need to be explicitly involved in writing the communication and synchronization routines required for two remote processes or tasks to interact and collaborate on solving one application. Depending on the relationship between the communicating processes, one can identify two types of message passing paradigms: peer-to-peer message passing, and master-slave message passing. In the peer-to-peer Message Passing, any process can communicate with any process in the system. This type is usually referred to by message passing. In the master-slave type, the communications are only between the master and the slave processes as in the remote procedure call paradigm. In what follows, we briefly describe these two types of message passing. In the peer-to-peer message passing model, there are two basic communications primitives: SEND and RECEIVE that are available to the users. However, there are many different forms to implement the SEND and RECEIVE primitives. This depends on the required type of communication between the source and destination processes: blocking or non-blocking, synchronous or asynchronous. The main limitations of this model is that the programmers must consider many issues while writing a distributed program such as synchronizing request and response messages, handling data representations especially when heterogeneous computers are involved in the transfer, managing machine addresses, and handling system failures that could be related to communications network or computer failures [Singhal and 18 High Performance Distributed Systems Mukesh, 1994]. In addition to all of these, debugging and testing Message Passing programs are difficult because their executions are time-dependent and the asynchronous nature of the system. The remote procedure calls mechanism has been used to alleviate some of the difficulties encountered in programming parallel and distributed applications. The procedure call mechanism within a program is a well-understood technique to transfer control and data between the calling and called programs. The RPC is an extension of this concept to allow a calling program on one computer to transfer control and data to the called program on another computer. The RPC system hides all the details related to transferring control and data between processes and give them the illusion of calling a local procedure within a program. The remote procedure call model provides a methodology for communication between the client and server parts of a distributed application. In this model, the client requests a service by making what appears to be a procedure call. If the relevant server is remote, the call is translated into a message using the underlying RPC mechanism and then sent over the communication network. The appropriate server receives the request, executes the procedure and returns the result to the client. Shared Memory Model In message passing model, the communication between processes is controlled by a protocol and involves explicit cooperation between processes. In Shared memory model, communication is not explicitly controlled and it requires the use of a global shared memory. The two forms of communication models can be compared using the following analogies: message communication resembles the operation of a postal service in sending and receiving mail. A simpler form of message communication can be achieved using a shared mailbox scheme. On the other hand, the shared memory scheme can be compared to a bulletin board, sometimes found in a grocery store or in a supermarket where users post information such as ads for merchandise or help wanted notices. The shared memory acts as a central repository for existing information that can be read or updated by anyone involved. 19 High Performance Distributed Systems Shared Memory Memory Memory CPU Memory ... CPU CPU Network Figure2.8 Distributed Shared Memory Model Most of distributed applications have been developed based on the message passing model. However, the current advances in networking and software tools have made it possible to implement distributed applications based on shared memory model. In this approach, a global virtual address space is provided such that processes or tasks can use this address space to point to the location where shared data can be stored or retrieved. In this model, application tasks or processes can access shared data by just providing a pointer or an address regardless of the location of where the data is stored. Figure 2.8 shows how a Distributed Shared Memory (DSM) system can be built using the physical memory systems available in each computer. The advantages of the DSM model include easy to program, easy to transfer complex data structures, no data encapsulation is required as is the case in message passing model, and portability (program written for multiprocessor systems can be ported easily to this environment) [Stumm and Zhou, 1990]. The main differences between message passing and shared memory models can be highlighted as follows: 1) The communication between processes using shared memory model is simpler because the communicated data can be accessed by performing reading operations as if they were local. In the message passing system, a message must be passed from one process to another. Many other issues must be considered in order to transfer efficiently the inter-process messages such as buffer management and allocation, routing scheme, flow control, and error control; and 2) Message passing system is scalable and can support a large number of heterogeneous computers interconnected by a variety of processor interconnect schemes. However, in shared memory model, this approach is not as scalable as the Message Passing model because the complexity of the system increases significantly when the number of computers involved in the distributed shared memory becomes large. 20 High Performance Distributed Systems 2.5 Summary Distributed computing systems field is relatively new and as a result there is no general consensus on what constitute a distributed system and how to characterize and design such type of computing systems. In this chapter, we have presented the design issues of distributed systems in a three layer design model: 1) Network, Protocol, and Interface (NPI) layer, 2) System Architecture and Services (SAS) layer, and 3) Distributed Computing Paradigms (DCP) layer. Each layer defines the design issues and technologies that can be used to implement the distributed system components of that layer. The NPI layer addresses the main issues encountered during the design of the communication system. This layer is decomposed into three sub-layers: Networks, Communication Protocols and Network Interfaces. Each sub-layer denotes one important communication component (subsystem) required to implement the distributed system communication system. The SAS layer represents the designers, developers, and system managers’ view of the system. It defines the main components of the system, system structure or architecture, and the system level services required to develop distributed computing applications. Consequently, this layer is decomposed into two sub-layers: architectural models and system level services. The architectural models describe the structure that interconnects the main components of the system and how they perform their functions. These models can be broadly classified into four categories: server model, pool model, integrated model, and hybrid model. The majority of distributed systems that are currently in use or under development are based on the server model (which is also referred to as workstation or client/server model). The distributed system level services could be provided by augmenting the basic functions of an existing operating system. These services should support global system state or knowledge, inter-process communication, distributed file service, concurrency control, redundancy management, load balancing and scheduling, fault tolerance and security. The Distributed Computing Paradigm (DCP) layer represents the programmer (user) perception of the distributed system. It focuses on the programming models that can be used to develop distributed applications. The design issues of this can be classified into two models: Computation and Communication Models. The computation model describes the mechanisms used to implement the computational tasks associated with a given application. These mechanisms can broadly be described by two models: Functional Parallel, and Data Parallel. The communication models describe how the computational tasks exchange information during the application execution. The communication models can be grouped into two types: Message Passing (MP) and Shared Memory (SM). 2.6 Problems 1. Explain about Distributed System Reference Model. 21 High Performance Distributed Systems 2. What are the main issues involved in designing a high performance distributed computing system? 3. With the rapid deployment of ATM-based networks, hub-based networks are expected to be widely used. Explain why these networks are attractive. 4. Compare the functions of data link layer with those offered by the transport layers in the ISO OSI reference model. 5. Suppose you wanted to perform the task of finding all the primes in a list of numbers, using a distributed system, • Develop three distributed algorithms, based on each of the following programming models to sort the prime numbers in the list: (i) funcational, (ii) data, and (iii) remote procedure call. • Choose one of the algorithms you described in part-1 and show how this algorithm can be implemented using each of the following two communication models: (i) message passing and (ii) shared memory. 6. Compare the distributed system architectural models by showing their advantages and disadvantages. For each architectural model, define the set of applications that are most suitable for that model. 7. You are asked to design a distributed system lab that supports the computing projects and assignments of computer engineering students. Show how the Distributed System Reference Model can be used to design such a system. References 1. Liebowitz B.H., and Carson, J.H., ``Multiple Processor Systems for Real-Time Applications'', Prentice-Hall, 1985. 2. Weitzman, Cay. ``Distributed micro/minicomputer systems: structure, implementation, and application''. Englewood Cliffs, N.J.: Prentice-Hall, c1980. 3. Halsall, Fred. ``Data communications, computer networks and open systems''. 3rd ed. Addison-wesley 1992. 4. LaPorta, T.F., and Schwartz, M., `Àrchitectures, Features, and Implementations of High-Speed Transport Protocols'', IEEE Network Magazine'', May 1991. 5. Mullender, S., Distributed Systems, Second Edition, Addison-Wesley, 1993. 6. Coulouris, G.F., Dollimore, J., Distributed Systems: Concepts and Design, Addison-Wesley, 1988. 22 High Performance Distributed Systems 7. Hillis, W. D. and Steele, G. Data parallel algorithms, Comm. ACM, 29:1170, 1986. 8. Hatcher, P. J., and Quinn, M. J., Data-Parallel Programming on MIMD Computers. MIT Press, Cambridge, Massachusetts, 1991. 9. Singhal, Mukesh. `Àdvanced concepts in operating systems : distributed, database, and multiprocessor operating systems''. McGraw-Hill, c1994. 10. IBM, ``Distributed Computing Environment Understanding the Concepts''. IBM Corp. 1993. 11. M. Stumm and S. Zhou, "Algorithms Implememting Distributed Shared Memory", Computer; Vol.23, No. 5, May 1990, pp. 54-64. 12. B. Nitzberg and V. Lo, "Distributed Shared Memory: A Survey of Issues and Algorithms", Computer, Aug. 1991, pp. 52-60. 13. K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems", ACM Trans. Computer Systems, Vol.7, No. 4, Nov. 1989, pp. 321-359. 14. K. Li and R. Schaefer, "A Hypercube Shared Virtual Memory System", 1989 Inter. Conf. on Parallel Processing, pp. 125-132. 15. B. Fleisch and G. Popek, "Mirage : A Coherent Distributed Shared Memory Design", Proc. 14th ACM Symp. Operating System Principles, ACM ,NY 1989, pp. 211-223. 16. J. Bennet, J. Carter, and W. Zwaenepoel, "Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence", Porc. 1990 Conf. Principles and Practice of Parallel Programming, ACM Press, New York, NY 1990, pp. 168176. 17. U. Ramachandran and M. Y. A. Khalidi, "An Implementation of Distributed Shared Memory", First Workshop Experiences with building Distributed and Multiprocessor Systems, Usenix Assoc., Berkeley, Calif., 1989, pp. 21-38. 18. M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, Coherence, and Event Ordering in Multiprocessors", Computer, Vol. 21, No. 2, Feb. 1998, pp. 9-21. 19. J. K. Bennet, "The Design and Implementation of Distributed Smalltalk", Proc. of the Second ACM conf. on Object-Oriented Programming Systems, Languages and Applications, Oct. 1987, pp. 318-330. 23 High Performance Distributed Systems 20. R. Katz, 5. Eggers, D. Wood, C. L. Perkins, and R. Sheldon, "Implementing a Cache Consistency Protocol", Proc. of the 12th Annu. Inter. Symp. on Computer Architecture, June 1985, pp. 276-283. 21. P. Dasgupta, R. J. LeBlane, M. Ahamad, and U Ramachandran, "The Clouds Distributed Operating System," IEEE Computer, 1991, pp.34-44 22. B. Fleich and G. Popek, "Mirage: A Coherence Distributed Shared Memory Design," Proc. 14th ACM Symp. Operating System Principles, ACM, New York, 1989,pp.21 1-223. 23. D. Lenoskietal, "The Directoiy-Based Cache Coherence Pro to col for the Dash Multiprocessor, "Proc. 17th Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 2047, 1990, pp. 148-159. 24. R. Bisiani and M. Ravishankar, "Plus: A Distributed Shared-Memoiy System," Proc. 17th Int'l Symp. Computer Architecture, WEE CS Press, Los Alamitos,Calif., Order No. 2047,1990, pp.115-124. 25. J. Bennett, J. Carter, And W. Zwaenepoel. "Munin: Distributed Shared Memory based on Type-Specific Memoiy Coherence, "Proc. 1990 Conf Principles and Practice of Parallel Programming, ACM Press, New York, N.Y., 1990, pp.168176. 26. D. R. Cheriton, "Problem-oriented shared memoiy : a decentralized aproach to distributed systems design ",Proceedings of the 6th Internation Conference on Distributed Computing Systems. May 1986, pp. 190-197. 27. Jose M. Bernabeu Auban , Phillip W. Hutto, M. Yousef A. Khalidi, Mustaque Ahamad, Willian F. Appelbe, Partha Dasgupta, Richard J. Leblanc and Umarkishore Ramachandran, "Clouds--a distributed, object-based operating system: architecture and kernel implication ", European UNIX Systems User Group Autumn Conference, EUUG, October 1988, pp.25-38. 28. Francois Armand, Frederic Herrmann, Michel Gien and Marc Rozier, "Chorus, a new technology for building unix systems", European UNIX systems User Group Autumn Conference, EUUG, October 1988, ppi-18. 29. G. Delp. ``The Architecture and Implementation ofMemnet: A High-speed Shared Memoy Computer Communication Network, doctoral disseration'', University of Delaware, Nework, Del., 1988. 30. Zhou et al., "A Heterogeneous Distributed Shared Memory," to be published in IEEE Trans. Parallel and Distributed Systems. 24 High Performance Distributed Systems 31. Geoffrey C. Fox, Roy D. Williams, and Paul C. Messina. ``Parallel Computing Works!''. Morgan Kaufmann, 1994. 32. D.E.Tolmie, A.Tanlawy(ed), “High Performance Networks – Technology and Protocols”, Norwell, Massachusetts, Kluwer Academic Publishers, 1994. 33. W. Stallings, “Advances in Local and Metropolitan Area Networks”, Los Alamitos, California, IEEE Computer Society Press, 1994. 34. B.N.Jain, “Open Systems Interconnection: Its Architecture and Protocols”, New York, McGraw-Hill, 1993. 35. T.L.Casavant, et al (ed), “Parallel Computers: Theory and Practice”, IEEE Computer Society Press, 1996 36. M.J.Quinn, “Parallel Computing: Theory and Practice”, New York, McGrawHill, 1994 37. A. S. Tanenbaum, Computer Networks, 2nd Edition, Prentice-Hall, 1988. 25 Chapter 3 Computer Communication Networks Objective of this chapter: The performance of any distributed system is significantly dependent on its communication network. The performance of the computer network is even more important in high performance distributed systems. If the network is slow or inefficient, the distributed system performance becomes slower than that can be achieved using a single computer. The main objective of this chapter is to briefly review the basic principles of computer network design and then focus on high speed network technologies that will play an important role in the development and deployment of high performance distributed systems. Key Terms LAN, MAN, WAN, LPN, Ethernet, CSMA/CD, FDDI, DQDB, ATM, Infiniband, Wireless LAN 3.1 Introduction Computer networking techniques can be classified based on the transmission medium (fiber, copper, wireless, satellite, etc.), switching technique (packet switching or circuit switching), or distance. The most widely used technique is the one based on distance that classifies computer network into four categories: Local Area Networks (LAN), Metropolitan Area Networks (MAN), Wide Area Networks (WAN), and Local Peripheral Networks (LPN). A LPN covers a relatively short distance (tens or hundreds of meters) and is used mainly to interconnect input/output subsystems to one or more computers. A LAN covers a building or a campus area (few Kilometers) and usually has a simple topology (bus, ring, or star). Most of the current distributed systems are designed using LANs that operate at 10 to 1000 million bits per second (Mbps). A MAN covers a larger distance than a LAN, typically a city or a region (around 100 Kilometers). LANs are usually owned and controlled by one organization, whereas MANs typically use the services of one or more telecommunication providers. A WAN can cover a whole country or one or more continents and it utilizes the network(s) provided by several telecommunications carriers. In this chapter, we briefly review each computer network type (LAN, MAN, WAN and LPN) and then discuss in detail the high speed network technology associated with each computer network type. The high speed networks to be discussed in detail include Fiber Distributed Data Interface (FDDI), Distributed Queue Data Buffer (DQDB) and Asynchronous Transfer Mode (ATM) networks. A more detailed description of other types of computer networks can be found in texts that focus mainly on computer networks [Tanenbaum, 1988; Stallings, 1995; Strohl, 1991; Kalmanek, Kanakia and Keshav, 1990]. 3.2 LOCAL AREA NETWORKS (LAN) LANs typically support data transmission rates from 10 Mbps to Giga bit per second (Gbps). Since LANs are designed to span short distances they are able to transmit data at high rates with a very low error rate. The topology of a LAN is usually simple and can be either a bus, ring or a star. The IEEE 802 LAN standards shown in Figure 3.1 define the bottom three layers of the OSI Reference Model. Layer one is the physical layer, layer two is composed of the medium access control (MAC) sub-layer and the logical link control (LLC) sub-layer and layer three is the network layer. 802.2 Logical Link Control (LLC) 802.1 Bridging 802.10 Sequrity and Privacy 802.1 Overview, Architecture and Management Data Link Layer MAC MAC MAC MAC MAC MAC MAC CSMA CD Token Bus Token Ring IVD Wireless Future MAN 802.3 802.4 802.5 802.6 802.9 802.11 Extensions Physical Layer 802.7 Broadband Tag 802.8 Fiber Optic Tag Figure 3.1 IEEE 802 LAN Standards The main IEEE LAN standards include Ethernet, Token Ring and FDDI. Ethernet is until now by far the most widely used LAN technology, whereas FDDI is a mature high speed LAN technology. 3.2.1 ETHERNET Ethernet (IEEE 802.3) is the name of a popular LAN technology invented at Xerox PARC in the early 1970s. Xerox used an Ethernet-based network to develop a distributed computing system in which users on workstations (clients) communicated with servers of various kinds including file and print servers. The nodes of an Ethernet are connected to the network cable via a transceiver and a tap. The tap and transceiver make the physical and logical connection onto the Ethernet cable. The transceiver contains logic, which controls the transmission and reception of serial data to and from the cable. Ethernet uses Carrier Sense Multiple Access with Collision (CSMA/CD) protocols to control access to the cable [Bertsekas, 1995; Dalgic, Chien, and Tobagi, 1994]. CSMA is a medium access protocol in which a node listens to the Ethernet medium before transmitting. The probability of the collision is reduced because the transmitting node transmits its message only after it found that the transmission medium is idle. When a node finds out that the medium is inactive, it begins transmitting after waiting a mandatory period to allow the network to settle. However, because of the propagation delay, there is a finite probability that two or more nodes on the Ethernet will simultaneously find the medium in idle state. Consequently, two (or more) transmissions might start at the same time and that result in collisions. In CSMA/CD, collisions are detected by comparing the transmitted data with the data received from the Ethernet medium to see if the message on the medium matches the message being transmitted. The detection of a collision requires that the collided nodes to retransmit their messages. Retransmission can occur at random or it can follow the exponential back-off rule; an algorithm used for the generation of a random time interval that determines at what time each collided station can transmit its message. High Speed Ethernet Technologies Recently, there has been an increased interest in the industry to revive Ethernet by introducing switched fast Ethernet and Gigabit Ethernet. Fast Ethernet is similar to Ethernet, only ten times faster than Ethernet. Unlike other emerging High Speed network technologies, Ethernet has been installed for over 20 years in business, government, and educational networks. Fast Ethernet uses the same media access protocol (MAC) used in Ethernet (CSMA/CD protocol). This makes the transition from Ethernet to fast Ethernet as well as the inter-networking between Ethernet and fast Ethernet to be a straightforward. Fast Ethernet can work with unshielded twisted-pair cable and thus can be built upon the existing Ethernet wire. That makes fast Ethernet attractive when compared to other high speed networks such as FDDI and ATM that require fiber optic cables, which will make the upgrade of existing legacy network to such high speed network technologies costly. In designing the fast Ethernet MAC and to make it inter-operate easily with existing Ethernet networks, the duration of each bit transmitted is reduced by a factor of 10. Consequently, this results in increasing the packet speed by 10 times when compared to Ethernet, while packet format and length, error control, and management information remain identical to those of Ethernet. However, the maximum distance between a computer and the fast Ethernet hub/switch depends on the type of cable used and it ranges between 100400 m. This High Speed network technology is attractive because it is the same Ethernet technology, but 10 or 100 times faster. Also, fast Ethernet can be deployed as a switching or shared technology as is in Ethernet. However, its scalability and the maximum distance are limiting factors when compared with fiber-based High Speed network technology (e.g., ATM technology). Gigabit Ethernet (GigE) is a step further from Fast Ethernet. It supports Ethernet speeds of 1Gbps and above and supports Carrier Sense Multiple Access/ Collision Detection (CSMA/CD) as the access method like the Ethernet. It supports both half-duplex and duplex modes of operation. It preserves the existing frame size of 64-1518 bytes specified by IEEE 802.3 for Ethernet. GigE offers speeds compared to ATM at much lesser cost and can support packets belonging to timesensitive applications in addition to video traffic. The IEEE 802.3z under development will be the standard for GigE. Bit GigE can support a range of 3-5 kms only. 3.2.2 FIBER DISTRIBUTED DATA INTERFACE (FDDI) FDDI is a high speed LAN proposed in 1982 by the X3T9.5 committee of the American National Standard (ANSI). The X3T9.5 standard for FDDI describes a dual counter-rotating ring LAN that uses a fiber optic medium and a token passing protocol [Tanenbaum, 1988; Ross, 1989; Jain, 1991]. FDDI transmission rate is 100 Mbps. The need for such a High Speed transmission rate has grown from the need for a standard high speed interconnection between computers and their peripherals. FDDI is suitable for front-end networks, typically with an office or a building, which provides interconnection between workstations, file servers, database servers, and low-end computers. Also, the high throughput of FDDI network makes it an ideal network to build a high performance backbone that bridge together several low speed LANs (Ethernet LANs, token rings, token bus). Fig 3.2 FDDI ring structure showing the primary and secondary loops FDDI uses optical fiber with light emitting diodes (LEDs) transmitting at a nominal wavelength of 1300 nanometers. The total fiber path can be up to 200 kilo meters (km) and can connect up to 500 stations separated by a maximum distance of 2 km. An FDDI network consists of stations connected by duplex optical fibers that form dual counter-rotating rings as shown in Figure 3.2. One of the rings is designated as the primary ring and the other one as the secondary. In normal operation data is transmitted on the primary ring. The secondary ring is used as a backup to tolerate a single failure in the cable or in a station. Once a fault is detected, a logical ring is formed using both primary and secondary rings and bypass the faulty segment or station as shown in Figure 3.3. Figure 3.3 FDDI Rings FDDI Architecture and OSI Model The FDDI protocol is mainly concerned with only the bottom two layers in the OSI reference model: Physical Layer and Data Link Layer as shown in Figure 3.4. The Physical layer is divided into a Physical (PHY) Protocol sub-layer and the Physical Medium Dependent (PMD) sub-layer. The PMD sub-layer focuses on defining the transmitting and receiving signals, specifying power levels and the types of cables and connectors to be used in FDDI. The physical layer protocol focuses on defining symbols, coding and decoding techniques, clocking requirements, states of the links and data framing formats. The data Link Layer is subdivided into a Logical Link Control (LLC) sub-layer (the LLC sub-layer is not part of the FDDI protocol specifications) and a Media Access Control (MAC) sub-layer. The MAC sub-layer provides the procedures needed for formatting frames, error checking, token handling and how each station can address and access the network. In addition, a Station Management (SMT) function is included in each layer to provide control and timing for each FDDI station. This includes node configuration, ring initialization, connection and error management. In what follows, we discuss the main functions and algorithms used to implement each sub-layer in the FDDI protocol. Figure 3.4 FDDI and the OSI Model Physical Medium Dependent (PMD) Sub-Layer The Physical Medium Dependent defines the optical hardware interface required to connect to the FDDI rings. This sub-layer deals with the optical characteristics such as the optical transmitters and receivers, the type of connectors to the media, the type of optical fiber cable and an optional bypass optical switch. PMD layer is designed to operate with multimode fiber optics with a wavelength of 1300 nanometer (NM). The distance between two stations is up to 2 kilometers in order to guarantee proper synchronization and allowable data dependent jitter. When a single mode fiber is used, a variation of the PMD sub-layer (SMF-PMD) should be used. The single mode fiber extends the distance between two stations up to 60 kilometers. The PMD document aims to provide link transmission rate with bit error rates (BER) of 2.5-10 at the minimum received power level, and with better than 1012 when the power is 2 dB or more above the minimum received power level. Physical (PHY) Sub-Layer The PHY sub-layer provides the protocols and optical hardware components that support a link from one FDDI station to another. Its main functions are to define the coding, decoding, clocking, and data framing that are required to send and receive data on the fiber medium. The PHY layer supports duplex communications. That is, it provides simultaneous transmission and receiving of data from and to the MAC sub-layer. The PHY sub-layer receives data from MAC sub-layer at the data link layer. It then encodes the data into 4B/5B code format before it is transmitted on the fiber medium as shown in Figure 3.5. Similarly, the receiver receives the encoded data from the medium, determines symbol boundaries based on the recognition of a Start Delimiter, and then forwards decoded symbols to MAC sub-layer. Data transmitted on the fiber is encoded in a 4 of 5 group code (4B/5B scheme) with each group is referred to as a symbol as shown in Figure 3.5. With 5-bit symbols, there are 32 possible symbols: 16 data symbols, each representing 4 bits of ordered binary data; 3 for starting and ending delimiters, 2 are used as control indicators, and 3 are used for line-state signaling which is recognized by the physical layer hardware. The remaining 8 symbols are not used since they violate code run length and DC balance requirements [Stallings, 1983]. The 4B/5B scheme is relatively efficient on bandwidth since 100 Mbps data rate is transmitted on the fiber optic at 125 Mbps rate. This is better than the Manchester encoding scheme used in Ethernet; a 10 Mbps data transmission rate is transmitted on the medium at 20 Mbps. After the data is encoded into 4B/5B symbols, it is also translated using a Non-Return to Zero Inverted (NRZI) code before it is sent to the optical fiber. The NRZI coding scheme reduces the number of transitions of the transmitted data streams and thus reduces the complexity of FDDI hardware components [Black-Emerging]. In NRZI coding scheme, instead of sending a zero bit as a logic level low (absence of light for optical medium), a zero is transmitted as the absence of a transition from low to high or from high to low. A one is transmitted as a transition from low to high or high to low. The advantage of this technique is that it eliminates the need for defining a threshold level. A pre-defined threshold is susceptible to a drift in the average bias of the signal. The disadvantage of the NRZI encoding is the loss of the self-clocking property as in Manchester encoding. To compensate for this loss, a long preamble is used to synchronize the receiver to the sender's clock. Figure 3.5 FDDI 4B/5B Code Scheme The clocking method used in FDDI is point-to-point; all stations transmit using their local clocks. The receiving stations decodes the received data by recognizing that a bit 1 will be received when the current bit is the complement of the previous bit and a bit 0 when the current bit is the same as the previous bit. By detecting the transitions in the received data, the receiver station can synchronize its local clock with the transmitter clock. An elasticity buffer (EB) function is used to adjust the slight frequency difference between the recovered clock and the local station clock. The elasticity buffer is inserted between the receiver, which supports a variable frequency clock to track the clock of the previous transmitting station, and the transmitter at the receiver side, which runs on a fixed frequency clock. The elasticity buffer in each station is reinitialized during the preamble (PA), which precedes each frame or token. The transmitter clock has been chosen with 0.005% stability. With an elasticity buffer of 10 bits, frames of up to 4500 bytes in length can be supported without exceeding the limit of the elasticity buffer. Media Access Control (MAC) sub-layer The MAC protocol is a timed token ring protocol similar to the IEEE standard 802.5. The MAC sub-layer controls the transmission of data frames on the ring. The formats of the data and token frames are shown in Figure 3.6. The preamble field is a string of 16 or more non-data symbols that are used to re-synchronize the receiver's clock to the received frame. The frame control field contains information such as to whether the frame is synchronous/asynchronous and whether 16 or 48 bit addresses are used. The ring network must support 16 and 48 bit addresses as well as a global broadcast feature to all stations. The frame check field is a 32-bit cyclic redundancy check (CRC) for the fields. The frame status indicates whether the frame was copied successfully, an error was detected and/or address was recognized. It is used by the source station to determine successful completion of the transmission. FRAME PA SD FC DA SD INFORMATION FCS ED FS Covered by CFS TOKEN PA SA FC DA Figure 3.6 Formats of FDDI Data and Token Frames The basic concept of a ring is that each station repeats the frame it receives to its next station [Ross, 1986; Stalling, 1983]. If the destination station address (DA) of the frame matches the MAC’s address, then the frame is copied into a local buffer and the LLC is notified of the frame’s arrival. MAC marks the Frame Status (FS) field to indicate three possible outcomes: 1) Successful recognition of the frame address, 2) the copying of the frame into a local buffer, or 3) the deletion of an erroneous frame. The frame propagates around the ring until it reaches the station that originally placed it on the ring. The transmitting station examines the FS field to determine the success of the transmission. The transmitting station is responsible for removing from the ring all its transmitted frames; this process is referred to as frame stripping. During the stripping phase, the transmitting station inserts IDLE symbols on the ring. If a station has a frame to transmit then it can do so only after the token has been captured. A token is a special frame, which indicates that the medium is available for use as shown in Figure 3.3. FDDI protocol supports multiple priority levels to assume the proper handling of frames. If the priority of a station does not allow it to capture a token (its priority is less than the priority of the token), it must repeat it to the next station. When a station captures the token, it removes it from the ring, transmits one or more frames depending on the Token Rotation Time (TRT) and Target Token Rotation Time (TTRT) as will be discussed later, and when it is completed, it issues a new token. The new token indicates the availability of the medium for transmission by another station. Timed Token Protocol FDDI protocol is a timed token protocol that allows a station to have a longer period for transmission when previous stations do not hold the token too long; they do not have any data frame to send and thus they relinquish the token immediately. During the initialization, a target token rotation time (TTRT) is negotiated, and the agreed value is stored in each station. The actual token rotation time (TRT) is stored in each station and resets once the token arrives. The amount of traffic, both synchronous and asynchronous, that FDDI allows on the network is related to the following equation: TTRT ≥ TRT+THT where TRT denotes the token rotation time, that is the time since the token was last received; THT denotes the token holding time, that is the time that the station has held onto the token; and TTRT denotes the target token rotation time, that is the desired average for the token rotation time. Essentially, this equation states that on average, the token must circulate around the ring within a pre-determined amount of time. This property explains why FDDI protocol is known as ``timed token protocol''. The TTRT is negotiated and agreed upon by all the stations at initialization of the network. The determination of TTRT to obtain the best performance has been the subject of many papers and is mainly determined by the desired efficiency of the network, the desired latency in accessing the network, and the expected load on the network [Agrawal, Chen, and Zhao, 1993]. TRT is constantly re-calculated by each station and is equal to the amount of time since the token was last received. THT is the amount of time that a station has held onto the token. A station that has the token and wants to transmit a message must follow the following two rules: 1) Transmit any synchronous frames that are required to be transmitted. Asynchronous frame may be transmitted If TTRT ≥ TRT + THT, before the token is released and put back on the ring. Synchronous traffic has priority over asynchronous traffic because of the deadlines that need to be met. In order to reserve bandwidth for asynchronous traffic, the amount of synchronous traffic allocated to each station is negotiated and agreed upon at network initialization. In addition, FDDI has an asynchronous priority scheme with up to 8 levels based upon the following inequality: Ti ≥ TRT + THT where Ti denotes the time allocated to transmit asynchronous traffic of priority i (i priority can range from 1 to 8). FDDI also contains a multi-frame asynchronous mode, which supports a continuous dialogue between stations. Two stations may communicate multi-frame information by the use of a restricted token. If a station transmits a restricted token instead of a normal token, then only the station, which received the last frame may transmit. If both stations continue to transmit only a restricted token, then a dedicated multi-frame exchange is possible. This feature only affects asynchronous communication. Synchronous communication is unaffected since all stations are still required to transmit any synchronous frames. Logical Link Control (LLC) Layer The Logical Link Control Layer is the means by which FDDI communicates with higher level protocols. FDDI does not define an LLC sub-layer but has been designed to be compatible with the standard IEEE 802.2 LLC format. Station Management (SMT) Functions The Station Management Function monitors all the activities on the FDDI ring and provide control over all stations functions. The main functions of the SMT include [Ross, 1986]: Fault Detection/Recovery The FDDI protocol contains several techniques to detect, isolate and recover from network failures. The recovery mechanisms can be grouped into two categories: protocol related failures and physical failures. Connection Management This involves controlling the bypass switch in each FDDI node, initializing valid PHY links, and positioning MACs on the appropriate FDDI ring. Frame Handling This function assists in network configuration. SMT uses a special frame, Next Station Address (NSA) frame, to configure the nodes on the FDDI rings. Synchronous Bandwidth Management The highest priority in FDDI is given to synchronous traffic where fixed units of data are to be delivered at regular time intervals. Delivery is guaranteed with a delay not exceeding twice TTRT. The bandwidth required for synchronous traffic is assigned first and the remaining bandwidth is allocated for the asynchronous traffic. In what follows, we discuss the protocol used to initialize the TTRT interval. For proper operation of FDDI's timed token protocol, every station must agree upon the value of the targeted token rotation time. This initialization of the network is accomplished through the use of claim frames. If a station wants to change the value of the TTRT, it begins to transmit claim frames with a new value of TTRT. Each station that receives a claim frame must do one of two things: 1) If the value of TTRT in the claim frame is smaller than its current value, then use the new TTRT and relay the claim frame. 2) If the value of TTRT in the claim frame is greater than its current value, then transmit a new claim frame with the smaller TTRT . Initialization is complete when a station has received its own claim frame. This means that all stations now have the same value of TTRT. The station that received its own claim frame is now responsible for initializing the network to an operational state. FDDI protocol guarantees that the maximum delay that can be incurred in transmitting synchronous traffic is double the value of TTRT. Consequently, if a station needs the delay to be less than an upper bound (DELAYMAX), it attempts to set the TTRT to be equal to half of this upper bound, i.e., TTRT = DELAYMAX /2. A station can be connected to one or both rings and that connectivity determines the type of protocol functions to be supported by each station. There are three basic types of stations: Dual Attachment Station (DAS), Concentrator, and Single Attachment Station (SAS). The DAS type requires two duplex cables, one to each of the adjacent stations. The concentrator is a special DAS that provides connection to the ring for several low-end Single Attachment Stations. In this case, an SAS node is connected only to one ring and as a result, fault tolerance can be supported with SAS nodes. FDDI-II Architecture and OSI Model One main limitation of the FDDI synchronous protocol is that although on average frames will reach their destination at a periodic rate defined by TTRT, there is a possibility that a frame may reach its destination with an elapsed time greater than TTRT. This will occur under heavy network loading. For example, assume that one station is required to send some synchronous traffic and when it receives the token, the TRT is equal to TTRT. In this case, no asynchronous frames can be sent but it is still required to transmit its synchronous frame. As a result, the token's TRT will be greater than TTRT and this condition may cause a glitch in an audio or video signal, which must be transmitted at a periodic rate. This limitation of the FDDI's synchronous protocol has led to the development of FDDI-II. FDDI-II adds to FDDI circuit switching capability so that it can handle the integration of voice, video and data over an FDDI network. In FDDI-II, the bandwidth is allocated to circuit-switched data in multiples of 6.144 Mbps isochronous channels. The term isochronous refers to the essential characteristics of a time scale or a signal such that the time intervals between consecutive significant instants either have the same duration or duration's that are multiples of the shortest duration [Teener, 1989]. The number of isochronous channels can be up to 16 channels using a maximum of 98.304 Mbps, where each channel can be flexibly allocated to form a variety of highways whose bandwidths are multiple of 8 Kbps (e.g., 8 Kbps, 64 Kbps, 1.53 Mbps or 2.04 Mbps). Consequently, the synchronous and asynchronous traffic may have only 1.024 Mbps bandwidth when the all the 16 isochronous channels are allocated. Isochronous channels may be dynamically assigned and de-assigned on a realtime basis with any of the unassigned bandwidth is allocated to the normal FDDI traffic (synchronous and asynchronous). Parameters FDDI ANSI X3 t 9.5 Token Ring IEEE 802. 5 Data Rate 100 Mbps Overall Length 100 Km Nodes 500 96 1024 Distance between Nodes 2 Km 0. 46 Km 0. 5 Km 8191 Octets 1514 Octets Packet Size (max) Medium 4500 Octets Fiber Meduim Access Dual-ring token passing 4 or 16 Mbps Ethernet IEEE 802. 3 1.2 Km Twisted Pair / Fiber Single-ring Token passing 10 Mbps 2.5 Km Coaxial Cable CSMA / CD Table 3.1: Comparison of Ethernet, FDDI and Token Ring FDDI-II represents a modification to the original FDDI specification such that an additional sub-layer (Hybrid Ring Controller - HRC) has been added to the Data Link Layer. The HRC allows FDDI-II to operate in an upwardly compatible hybrid mode that not only provides the standard FDDI packet transmission capability but it also provides an isochronous transport mode. The function of the HRC is to multiplex the data packets. This divides the FDDI-II data stream into multiple data streams, one for each of the wide band channels that has been allocated. More detailed information about FDDI-II circuit switched data format and how bandwidth is allocated dynamically to isochronous, synchronous and asynchronous traffic can be found in [Ross, 1986]. Copper FDDI (CDDI) Another cost-effective alternative to fiber optic FDDI is another standard that replaces the fiber optic cables by copper. The 100 Mbps copper FDDI (CDDI) standard would use the same protocol as FDDI except that its transmission medium would use the commonplace unshielded twisted-pair or shielded twistedpair copper wiring. The main advantage of using copper is that copper wiring, connectors, and transceivers are much cheaper. The main tradeoff in using copper wiring is that the maximum distance that could be traversed between nodes would be limited to possibly 50 or 100 meters before electromagnetic interference becomes a problem. This maximum distance is not a severe limiting factor since the CDDI network would be used mainly for communication within a small LAN that is physically located in one room or laboratory. The CDDI network could then interface to the larger FDDI network through a concentrator station. In this case, the FDDI network acts as a backbone network spanning large distances interconnecting smaller CDDI LANs with a great savings in cost. 3.3 Metropolitan Area Networks The DQDB is emerging as one of the leading technologies for high-speed metropolitan area networks. DQDB is a media access control (MAC) protocol, which is being standardized as the IEEE 802.6 standard for MANs [Stallings, 1995]. DQDB consists of two 150 Mbps contra-directional buses with two head nodes, one on each bus, that continuously send fixed-length time slots (53 octets) down the buses. The transmission on the two buses is independent and hence the aggregate bandwidth of the DQDB network is twice the data rate of the bus. The clock period of DQDB network is equal to 125 microseconds that has been chosen to support isochronous services, that is voice services that require 8Khz frequency. The DQDB protocol is divided into three layers: the first layer from the bottom corresponds to the physical layer of the OSI reference mode, the second layer corresponds to the medium access sublayer, and the third layer corresponds to the data-link layer as shown in Figure 3.7. DQDB protocols support three types of services: connection-less, connection-oriented and isochronous services. The main task of the convergence sublayer within a DQDB network is to map user services into the underlying medium-access service. The connection-less service transmits frames of length up to 9188 octets. Using fixed length slots of 52 octets, DQDB provides the capability to perform frame segmentation and reassembly. The connection-oriented service supports the transmission of 52-octet segments between nodes interconnected by a virtual channel connection. The isochronous service provides similar service to the- connection-oriented service, but for users that require a constant inter-arrival time. DQDB MAC Protocol DQDB standard specifies the Medium Access Control and the physical layers. Each bus independently transfers MAC cycle frames of duration 125 microseconds, each frame contains a number of short and fixed slots and frame header. The frames are generated by the head node at each bus and flow downstream passing each node before being discarded at the end of the bus. There are two types of slots: Queued Arbitrated Slots (QA) and Pre-Arbitrated Slots (PA). QA slots are used to transfer asynchronous segments and PA slots are used to transfer isochronous segments. In what follows, we focus on how the distributed queue algorithm controls the access to the QA slots. LLC Services Connection Oriented Services MCF Services CO Isochronous Other ICF DQDB QA Functions PA LME Common Layer Physical Layer Convergence Physical LME Transmission Figure 3.7: Functional Block Diagram of a DQDB Node The DQDB MAC protocol acts like a single first-in-first-out (FIFO) queue. At any given time, the node associated with the request at the top of the queue is allowed to transmit in the first idle slot on the required bus. However, this single queue does not physically exist, but instead it is implemented in a distributed manner using the queues available in each node. This is can be explained as follows. Each head of a bus continuously generates slots that contain in its header a BUSY (BSY) bit and a REQUEST (REQ) bit. The busy bit indicates whether or not a segment occupies the slot, while the REQ bit is used for sending requests for future segment transmission. The nodes on each bus counts the slots that have the request bit set and the idle slots that pass by, so that they can determine their position in the global distributed queue and consequently determine when they can start transmitting their data. Several studies [Stallings, 1987] have shown that the DQDB MAC access protocol is not fair because the node waiting time depends on its position with respect to the slot generators. As a result, several changes have been proposed to make DQDB protocol more fair [22]. Later, we discuss one approach, Bandwidth Balancing Mechanism (BWB) to address the unfairness issue in DQDB. The DQDB access mechanism associated with one bus can be implemented using two queue buffers and two counters. Without loss of generality, we name the bus A the forward bus and bus B the reverse bus. We will focus our description on segment transmission in the forward bus, the procedure for transmission in the reverse bus being the same. To implement the DQDB access mechanism on the forward bus, each node contains two counters - Request counter (RC) and Down counter (DC) and two queues, one for each bus. Each node can be in one of two states: idle when there is no segment to transmit, or count down. Idle State: When a node is in the idle state, the node keeps count of the outstanding requests from its downstream nodes using the RC counter. The RC counter increases by one for each request received in the reverse bus and decreased by one for each empty slot in the forward bus; each empty slot on the forward bus will be used to transmit one segment by downstream nodes. Hence, the value of the Request counter (RC) reflects the number of outstanding requests that have been reserved by the downstream nodes. BUS A B=0 Count Down (Cancel a request) for each empty QA slot on Bus A Dump RQ, Join Queue Request Count (RQ) Countdown (CD) Count requests on Bus B ; Increment RQ BUS B R=1 Figure 3.8 DQDB MAC Protocol Implementation Count Down State: When the node becomes active and has a segment to transmit, the node transfers the RC counter to the CD counter and resets the RC counter to zero. The node then sends a request in the reverse bus by setting REQ to 1 in the first slot with REQ bit equals to zero. The CD counter is decreased by one for every empty slot on the forward bus until it reaches zero. Immediately after this event, the node transmits into the first empty slot in the forward bus. Priority Levels DQDB supports three levels of priorities that can be implemented by using separate distributed queues, and two counters for each priority level. This means that each node will have six Request Counters and Down Counters, two for each priority. Furthermore, the segment format will have three request bits, one for each priority level. In this case, a node that wants to transmit on the Bus A with a priority level, say, it will set the request bit corresponding to this priority level in the first slot on Bus B that has not set the bit corresponding to priority. In this case, the Down Counter (DC) is decremented with every free slot passing on Bus A, but is incremented for every request on Bus B with a higher priority than the counter priority. The Request Counter (RC) is incremented only when a passing request slot has the same priority level; the higher priority requests have already been accounted for in the Down Counter [Stallings, 1987]. DQDB Fairness Several research results have shown that DQDB is unfair and the DQDB unfairness depends on the medium capacity and the bus length [Conti, 1991]. The unfairness in DQDB can result in unpredictable behavior at heavy load. One approach to improve the fairness of DQDB is to use the bandwidth balancing mechanism (BWB). In this mechanism, whenever the CD counter reaches a zero and the station transmits in the next empty slot, it sends a signal to the bandwidth balancing machine (BWB). The BWB machine uses a counter to count the number of segments transmitted by its station. Once this counter reaches a given threshold, referred to as BWB-MOD, the counter is cleared and the RQ-CTR is incremented by one. That means, that this station will skip one empty slot to be used by other downstream stations, which are further away from the slot generator on the forward bus and thus improving DQDB fairness. The value of BWB-MOD can vary from 0-16 where the 0 value means the bandwidth balancing mechanism is disabled [conti, 1991]. . Discussion A MAN is optimized for a larger geographical area than a LAN, ranging from several blocks of buildings to entire cities. As with local area networks, MANs can also depend on communication channels of moderate-to-high data rates. IEEE 802.6 is an important standard to cover this type of networks as well as LANs. It offers several transmission rates that can initially start at 44.7 Mbps and later expand it to speeds ranging from 1.544 Mbps to 155 Mbps. DQDB is different from FDDI and token ring networks because it uses a high speed shared medium that supports three types of traffic: bursty, asynchronous and synchronous. Furthermore, the use of fixed-length packets, that are compatible with ATM, provides an efficient and effective support for small and large packets and for isochronous data. 3.4 Wide Area Networks (WANs) The trend for transmission of information generated from facsimile, video, electronic mail, data, and images has speeded up the conversion from analogbased systems to high speed digital networks. The Integrated Services Digital Network (ISDN) has been recommended as a wide area network standard by CCITT that is expected to handle a wide range of services that cover future applications of high speed networks. There are two types of ISDN: Narrowband ISDN (N-ISDN) and Broadband ISDN (B-ISDN). The main goal of N-ISDN is to integrate the various services that include voice, video and data. The B-ISDN supports high data rates (hundreds of Mbps). In this section, we discuss the architecture and the services offered by these two types of networks. 3.4.1 Narrowband ISDN (N-ISDN) The CCITT standard defines an ISDN network as a network that provides end-toend digital connectivity to support voice and non-voice services (data, images, facsimile, etc.). The network architecture recommendations for ISDN should support several types of networks: packet switching, circuit switching, nonswitched, and common-channel signaling. ISDN can be viewed as a digital bit pipe in which multiple sources are multiplexed into this digital pipe. There are several communication channels that can be multiplexed over this pipe and are as follows: • B channel: It operates at 64 Kbps rate and it is used to provide circuit switched, packet switched and semi-permanent circuit interconnections. It is used to carry digital data, digitized voice and mixtures of lower-rate digital traffic. • D channel: It operates at 16 Kbps and it is used for two purposes: for signaling purposes in conjunction with circuit-switched calls on associated B channels, and as a pipe to carry packet-switched or slowspeed telemetry information. For the H channels, three hybrid channel speeds are identified: H0 channel that operates at 384 Kbps, H11 channel that operates at 1.536 Kbps, and H12 channel that operates at 1.92 Kbps. These channels are used for providing higher bit rates for applications such as fast facsimile, high-speed data, high quality audio and video. Two combinations of these channels have been standardized: basic access rate and primary access rate. The basic access consists of 2B+D channels, providing 192Kbps (including 48Kbps overhead). Typical applications which use this access mode are those addressing most of the individual users including homes and small offices, like simultaneous use of voice and data applications, teletext, facsimile etc. These services could either use a one multifunctional terminal or several terminals. Usually a single physical link is used for this access mode. The customer can use all or parts of the two B channels and the D channel. Most present day twisted pair loops will support this mode. The primary access mode is intended for higher data rate communication requirements, which typically fall under the category of nB+D channels. In this mode, the user can use all or part of the B channels and the D channel. This primary access rate service is provided using time division multiplexed signals over four-wire copper circuits or other media. Each B channel can be switched independently; some B channels may be permanently connected depending on the service application. The H channels can also be considered to fall into this category. Network Architecture and Channels ISDN Reference Model ISDN provides users with full network support by adopting the seven layers of the OSI reference model. However, ISDN services are confined to the bottom three layers (physical, data and network layers) of the OSI reference model. Consequently, ISDN offers three main services [stallings, 1993]: Bearer Services, Teleservices, and Supplementary Services. Bearer Services offer information transfer without alteration in real time. This service corresponds to the OSI's network service layer. There are various types of bearer services depending on the type of application sought. Typical applications include speech and audio information transfer. Teleservices combine the data transportation (using bearer services) and information processing. These services can be considered to be more user friendly services and use terminal equipment. Typical applications are telephony, telefaxes and other computer to computer applications. These correspond to all the services offered by the several layers of the OSI reference model. Supplementary Services are a mixture of one or more bearer or teleservices for providing enhanced services which include Direct-Dial-in, Conference Calling and Creditcard Calling. A detailed description of these services can be found in [stallings, 1987]. • • • Physical layer: This layer defines two types of interfaces depending on the type of access namely Basic interface (basic access) and Primary interface (primary access). Data link layer: This layer has different link-access protocols (LAP) depending on the channel used for the link, namely LAP-B (balanced for B channel) and LAP-D (for D channel). Apart from these link access techniques, frame-relay access is also a part of the protocol definition. Network layer: This layer includes separate protocols for packet switching (X.25), circuit switching, semi-permanent connection and channel signaling. ISDN User-Network Interfaces A key aspect of ISDN is that a small set of compatible user-network interfaces can support a wide range of user applications, equipment and configurations. The number of user-network interfaces are kept small to maximize user flexibility and to reduce cost. To achieve this goal ISDN standards define a reference model showing the functional groups and reference points between the groups. Functional groups are sets of functions needed in ISDN user access arrangements. Specific functions in the functional groups may be done in one or multiple pieces of actual equipment. Reference points are conceptual points for dividing the functional groups. In specific implementations, reference points may in fact represent a physical interface between two functional groups. The functional groups can be classified into two types of devices: Network Termination (NT1, NT2 and NT12) and Terminal Equipment (TE1 and TE2). Network Termination 1 (NT1) provides functions similar to those offered by the physical layer of the OSI Reference Model. Network Termination 2 (NT2) provides functions equivalent to those offered by layers 1 through 3 of the OSI reference model (e.g., protocol handling, multiplexing, and switching). These functions are typically executed by equipment such as PBXs, LANs, terminal cluster controllers and multiplexers. Network Termination 1,2 (NT12) is a single piece of equipment that combines the functionality of NT1 and NT2. Terminal Equipment (TE) provides functions undertaken by such terminal equipment as digital telephones, data terminal equipment and integrated voice/data workstations. There are two types of TEs, namely TE1 and TE2. TE1 refers to devices that support standard ISDN interface, while TE2 are those which don't directly support ISDN interfaces. Such nonISDN interfacing equipment requires Terminal adapters (TA) to connect into ISDN facility. The reference points define the interface between the functional groups and these include: Rate (R), System (S), Terminal (T), and User (U) reference points. The R reference point is the functional interface between a non-ISDN terminal and the terminal adapter. The S reference point is the functional interface seen by each ISDN terminal. The T reference point is the functional interface seen by the user's of NT1 and NT2 equipment. The U reference point defines the functional interface between the ISDN switch and the network termination equipment (NT2). Standardization of this reference point is essential, especially when NT1s and the Central Office modules are manufactured by different vendors. It is a generally accepted fact that ISDN can not only be used as a separate entity, but also as tributary network and can play an important role in hybrid networks. So applications that have been traditionally provided by different networking schemes can now be provided in conjunction with ISDN. Some typical applications for Video in the enterprise wide networks include video-telephony in 2B+D circuit switched networks, video conferences over public H0 and H11 links, reconfigure private video conferences networks over channel switched/permanent H0 and H11 links. Medical imaging over 23B+D networks is also one of the many partial lists of ISDN applications. 3.4.2 Broadband Integrated Service Data Network (B-ISDN) With the explosive growth of network applications and services, it has been recognized that ISDN's limited bandwidth cannot deliver the required bandwidth for these emerging applications. Consequently, the majority of the delegates within CCITT COM XVIII agreed that there is a need for a broadband ISDN (BISDN) that allows total integration of broadband services in 1985. And since then, the original ISDN is referred as Narrow-band ISDN (N-ISDN). The selected transfer mode for B-ISDN has changed several times since its inception. So far two types of transfer modes have been used for digital data transmission: Synchronous Transfer Mode (STM) and Asynchronous Transfer Mode (ATM). STM is suitable for traffic that has severe real time requirements (e.g., voice and video traffic). This mode is based on circuit switching service in which the network bandwidth is divided into periodic slots. Each slot is assigned to a call according to the peak rate of the call. However, this protocol is rigid and does not support bursty traffic. The size of data packets transmitted on a computer network varies dynamically depending on the current activity of the system. Furthermore, some traffic on a data communication network is time insensitive. Therefore, the STM is not selected for B-ISDN and the ATM, a packet switching technique, is selected. ATM technology divides voice, data, image, and video into short packets, and transmits these packets by interleaving them across an ATM link. The packet transmission time is equal to the slot length. In ATM the slots are allocated on demand, while for STM periodic slots are allocated for every call. In ATM, therefore, no bandwidth is consumed unless information is actually transmitted. Another important parameter is whether the packet size should be fixed or variable. The main factors that need to be taken into consideration when we compare fixed packet size vs. variable packet sizes are the transmission bandwidth efficiency, the switching performance (i.e. the switching speed, and the switch's complexity) and the delay. Variable packet length is preferred to achieve high transmission efficiency. Because with fixed packet length, a long message has to be divided into several data packets. And each data packet is transmitted with overhead. Consequently, the total transmission efficiency would be low. However, with variable packet length, a long message can be transmitted with only one overhead. Since the speed of switching depends on the functions to be performed, with fixed packet length, the header processing is simplified, and therefore the processing time is reduced. Consequently, from switching point of view, fixed packet length is preferable. From delay perspective, the packets with fixed small size result in minimal functionalities at intermediate switches and take less time in queue memory management; As a result, fixed size packets reduce the experienced delays in the overall network. For broadband network, with large bandwidth, the transmission efficiency is not as critical as high-speed throughput and low latency. The gain in the transmission efficiency brought by the variable packet length strategy is traded off for the gain in the speed and the complexity of switching and the low latency brought by the fixed packet length strategy. In 1988, the CCITT decided to use fixed size cells in ATM. Another important parameter that the CCITT needed to determine, once it decided to adopt fixed size cells, is the length of cells. Two options were debated in the choice of the cell length, 32 bytes and 64 bytes. The choice is mainly influenced by the overall network delay and the transmission efficiency. The overall end-toend delay has to be limited in voice connections, in order to avoid echo cancellers. For a short cell length like 32 bytes, voice connections can be supported without using echo cancellers. However, for 64 byte cells, echo cancellers need to be installed. From this point of view, Europe was more in favor of 32 bytes so echo cancellers can be eliminated. But longer cell length increases transmission efficiency, which was an important concern to the US and Japan. Finally a compromise of 48 bytes was reached in the CCITT SGXVIII meeting of June 1989 in Geneva.33 In summary, ATM network traffic is transmitted in fixed cells with 48 bytes as payload and another 5 bytes as header for routing through the network. The network bandwidth is allocated on demand, i.e., asynchronously. The cells of different types of traffic (voice, video, imaging, data, etc.) are interleaved on a single digital transmission pipe. This allows statistical multiplexing of different types of traffic if burst rate exceeds available bandwidth for a certain traffic type. An ATM network is highly flexible and can support high-speed data transmission as well as real-time voice and video applications. 3.5 ATM 3.5.1 Virtual Connections Fundamentally ATM is a connection-oriented technology, different from other connection-less LAN technologies. Before data transmission takes place in an ATM network, a connection needs to be established between the two ends using a signaling protocol. Cells then can be routed to their destinations with minimal information required in their headers. The source and destination IP addresses, which are the necessary fields of a data packet in a connection-less network, are not required in an ATM network. The logical connections in ATM are called virtual connections. VCI= 1, 2, 3 VCI= 1, 2, 3 VPI = 1 VPI = 5 VCI= 4, 5, 6 TRANSMISSION PATH VPI = 8 Figure 3.9 Relation between Transmission Path, VPs and VCs Two layers of virtual connections are defined by CCITT: virtual channel (VC) connections (VCC) and virtual path (VP) connections (VPC). One transmission path contains several VPs, as shown in Figure 3.9, and some of them could be permanent or semi-permanent. Furthermore, each VP contains bundles of VCs. By defining VPC and VCC, a virtual connection is identified by two fields in the header of an ATM cell: Virtual Path Identifier (VPI) and Virtual Channel Identifier (VCI). The VPI/VCI only have local significance per link in the virtual connection. They are not addresses and are used just for multiplexing and switching packets from different traffic sources. Hence ATM does not have the overhead associated with LANs and other packet switched networks; where packets are forwarded based on the headers and addresses that vary in location and size, depending on the protocol used. Instead an ATM switch only needs to perform a mapping between the VPI/VCI of a cell on the input link and an appropriate VPI/VCI value on the output link. •Virtual Channel Connection: Virtual channel connection is a logical endto-end connection. It is analogous to a virtual circuit in X.25 connection. It is the concatenation of virtual channel links, which exist between two switching points. A virtual channel has traffic usage parameters associated with it, such as cell loss rate, peak rate, bandwidth, quality of service and so on. •Virtual Path Connections: A virtual path connection is meant to contain bundles of virtual channel connections that are switched together as one unit. The use of virtual paths can simplify network architecture and increase network performance and reliability, since the network deals with fewer aggregated entities. VPI = 5 VCI = 1,2,3 VPI = 20 VCI = 5, 6 VPI = 5 VCI = 2 VPI = 5 VCI = 4 VPI = 8 VCI = 8 VPI = 8 VCI = 1,2,3 VPI = 10 VCI = 5,6 VPI = 6 VCI = 10 Both VPI and VCI values can be changed VCI values are unchanged a) VP Switching a) VP/ VC Switching Figure 3.10: Switching in ATM The VPI/VCI fields in an ATM cell can be used to support two types of switching: VP switching and VP/VC switching. In a VP switch, the VPI field is used to route the cells in the ATM switch while the VCI values are not changed as shown in Figure 3.10. 3.5.2 B-ISDN Reference Model The ATM protocol reference model consists of the higher layers, the ATM layer, the ATM Adaptation Layer (AAL), and the Physical layer. The ATM reference/stack model differs from the OSI (Open System Interconnection) model in its use of planes as shown in Figure 3.11. The portion of the architecture used for user-to-user or end-to-end data transfer is called the User Plane (U-Plane). The Control Plane (C-Plane) performs call connection control. The Management Plane (M-Plane) performs functions related to resources and parameters residing in its protocol entities. Figure 3.11 Layers of ATM ATM is connection-oriented, and it uses out-of-band signaling. This is in contrast with the in-band signaling mode of the OSI protocols (X.25) where control packets are inter-mixed with data packets. So during virtual channel connection setup, only the control plane is active. In OSI model, the two planes are merged and are indistinguishable. ATM Layers The ATM Layers are shown in Figure 3.11. We briefly discuss each of the layers. Physical Layer The Physical Layer provides the transport of ATM cells between two ATM entities. Based on its functionalities, the Physical Layer is segmented into two sublayers, namely Physical Medium Dependent (PMD) sublayer and the Transmission Convergence (TC) sublayer. This sub-layering separates transmission from physical interfacing, and allows ATM interfaces to be built on a variety of physical interfaces. The PMD sublayer is device dependent. Its typical functions include bit timing and physical medium like connectors. TC sublayer generates and recovers transmission frames. The sending TC sublayer performs the mapping of ATM cells to the transmission system. The receiving TC sublayer receives a bit stream from PMD, extracts the cells and passes them on to the ATM layer. It generates and checks HEC (header error control) field in the ATM header, and it also performs cell rate decoupling through deletion and insertion of idle cells. Synchronous Optical Network (SONET) The SONET (Synchronous Optical NETwork) also known internationally as Synchronous Digital Hierarchy (SDH), is a physical layer transmission standard of B-ISDN (Broadband Integrated Services Digital Network). SONET is a set of physical layers, originally proposed by Bellcore for specifying standards for optic fiber based transmission line equipment. It defines a set of framing standards, which dictates how bytes are transmitted across the links, together with ways of multiplexing existing line frames (T1, T3 etc.) into SONET. The lowest SONET frame rate called STS-1, defines 8 Khz frames of 9 rows and 90 bytes First 3 bytes are used for Operation, Administration and Management (OA M) purposes and the remaining 83 bytes are used for data. This gives a data rate of 51.84 Mbps. The next highest frame rate standard is STS-3 with 9 frames for OA M and 261 bytes for data, providing a 155.52 Mbps data rate. There are other higher speed SONET standards available: STS-12 - 622.08 Mbps, STS-24 - 1.244 Gbps, STS-48 - 2.488 Gbps and so on (STS-N - Nx51.84 Mbps). The capabilities of SONET are mapped on to a 4-layer hierarchy-namely Photonic (responsible for conversion between electrical and optical signals and specification for physical layer), Section (functionalities between repeater and multiplexer), Line (functionalities between multiplexers) and Path (function between end-to-end transport). CO Data Applications (AAL Type 3, 5) CO AAL CS Sublayer CL Data Applications (AAL Type 4) CL AAL CS Sublayer CBR Applications (AAL Type 1) CBR AAL CS Sublayer CBR Applications (AAL Type 2) VBR AAL CS Sublayer Segmentation and Reassembly AAL Sublayer ATM Layer Physical layer (SONET/ SDH) Figure 3.12 ATM Protocol Stack ATM Cell Format The ATM Layer performs multiplexing and de-multiplexing of cells from different connections (identified by different VPIs/VCIs) onto a single cell stream. It extracts cell headers from received cells and adds cell headers to the cells being transmitted. Translation of VCI/VPI may be required at ATM switches. Figure 3.13 (a) shows the ATM cell format. Cell header formats for UNI (User-Network Interface) and NNI (Network-Network Interface) are shown in Figures 3.13 (b) and (c), respectively. The functions of the various fields in the ATM cell headers are as follows: • Generic Flow Control (GFC): This is a 4-bit field used only across UNI to control traffic flow across the UNI and alleviate short term overload conditions, particularly when multiple terminals are supported across a single UNI. • Virtual Path Identifier (VPI): This is an 8-bit field across UNI and 12-bits across NNI. For idle cells or cells with no information the VPIs are set to zero, this is also the default value for VPI. The use of non-zero values of VPI across NNI is well understood (for trunking purposes), however the procedures for accomplishing them are under study. 8 Bits Header (5 bytes) 53 Bytes Payload (48 bytes) a) ATM Cell Format GFC VPI VPI VCI GFC: Generic Flow Control VPI: Virtual path Identifier VCI: Virtual Channel Identifier PTI: payload Type Identifier CLP: Cell Loss Priority HEC: Header Error Check VPI VPI VCI VCI PTI VCI CLP VCI HEC b) ATM Cell Header at UNI VCI PTI CLP HEC c) ATM Cell Header at NNI Figure 3.13 ATM Cell Format • Virtual Circuit Identifier (VCI): The 16-bit VCI is used to identify the virtual circuit in a UNI or an NNI. The default value for VCI is zero. Typically VPI/VCI values are assigned symmetrically; that is, the same values are reserved for both directions across a link. • Payload Type Identifier (PTI): This is a 3-bit field for identifying the payload type as well as for identifying the control procedures. When bit 4 in the octet is set to 0, it means it is a user cell. For user cells, if bit 3 is set to 0, it means that the cell did not experience any congestion in the relay between two nodes. Bit 2 for user cell is used to indicate the type of user cell. When bit 4 is set to 1, it implies the cell is used for management functions as error indications across the UNI. • Cell Loss Priority (CLP): This field is used to provide guidance to the network in the event of congestion. The CLP bit is set to 1 if a cell can be discarded during congestion. The CLP bit can be set by the user or by the network. An example for the network setting is when the user exceeds the committed bandwidth, and the link is under-utilized. • Header Error Check (HEC): This is an 8-bit Cyclic Redundancy Code (CRC) computed over all fields in the ATM cell header. It is capable of detecting all single bit errors and certain multiple bit errors. It can also be used to correct single bit errors, but is not mandatory. ATM Adaptation Layer The AAL Layer provides the proper interface between the ATM Layer and the higher layers. It enhances the services provided by the ATM Layer according to the requirements of specific applications: real-time, constant bit rate or variable bit rate. Accordingly, the services provided by AAL Layer can be grouped into four classes. The AAL Layer has five types of protocols to support the four classes of traffic pattern. The corresponding relation between the class of service and the type of AAL protocol are as follows. • Type 1: Supports Class A applications, which require constant bit rate (CBR) services with time relation between source and destination. Error recovery is not supported. Examples include real-time voice messages, video traffic and some current data video systems. • Type 2: Supports Class B applications, which require variable bit rate (VBR) services with time relation between source and destination. Error recovery is also not supported. Examples are teleconferencing and encoded image transmission. • Type 3: Supports Class C applications, which are connection-oriented (CO) data transmission applications. Time relation between source and destination is not required. It is intended to provide services to the applications that use a network service like X.25. • Type 4: Supports Class D applications, which are connection-less (CL) data transmission applications. Time relation between source and destination is not required. The current datagram networking applications like TCP/IP or TP4/CLNP belong to Class D. Since the protocol formats of AAL type 3 and type 4 are similar, they have been merged to AAL type ¾ • Type 5: This type was developed to reduce the overhead related to AAL type ¾. It supports connection-oriented services more efficiently. It is more often referred to as ‘Simple and Efficient AAL’, and it is used for Class C applications. The AAL layer is further divided into 2 sublayers: the convergence sublayer (CS) and the segmentation-and-reassembly sublayer (SAR). The CS is service dependent and provides the functions needed to support specific applications using AAL. The SAR sublayer is responsible for packing information received from CS into cells for transmission and unpacking the information at the other end. The services provided by the ATM and AAL Layers are shown in An important character in ATM traffic is its burstiness, meaning that some traffic sources may generate cells at a near-peak rate for a very short period of time and immediately afterwards it may become inactive, generating no cells. Such a bursty traffic source will not require continuous allocation of bandwidth at its peak rate. Since an ATM network supports a large number of such bursty traffic sources, statistical multiplexing can be used to gain bandwidth efficiency, allowing more traffic sources to share the bandwidth. But if a large number of traffic sources become active simultaneously, severe network congestion can result. In an ATM network, congestion control is performed by monitoring the connection usage. It is called source policing. Every virtual connection (VPC or VCC) is associated with a traffic contract which defines some traffic characteristics such as peak bit rate, mean bit rate, and duration of burst time. The network monitors all connections for possible contract violation. It is also a preventive control strategy. Preventive control does not wait until congestion actually occurs. It tries to prevent the network from reaching an unacceptable level of congestion by controlling traffic flow at entry points to the network. 3.6 Peripheral Area Networks (PAN) The current advances in computer and networking technology are changing the design of the networks that interconnect computers with their peripherals. The use of distributed computing systems allow users to transparently share and access remote computing and peripheral resources available across the network. Hence the complexity of the Local Peripheral Network (LPN)[Cummings, 1990]. Furthermore, the increased processing power of computers has also lead to a significant increase in input/output bandwidth and in the number of required channels. Applications performing intensive scientific, multimedia or database work demand an increase in the input/output bandwidth of computers and peripherals. The current input/output peripheral standards cannot meet the required input/output bandwidth. Even the cost of cabling and connections represent a significant portion of the total system cost. The specifications of these standards are as follows: • Small Computer Systems Interface (SCSI): This interface is enabled with two features - 1) a base SCSI designed to support low end systems, with a speed of 8-16 Mbps, and 2) a differential SCSI, designed to support middle systems, connects 8 units, over a distance of 25 meters, with a speed of 32 Mbps. • Intelligent Peripheral Interface (IPI): Designed to support middle systems, IPPI connects a channel to eight control units, over a distance of 75 meters, with a speed of 48-80 Mbps. • IBM Block Mux (OEMI): Designed to support high end systems. A channel can connect upto 7 units, over a distance of 125 meters, with a speed of 24-58 Mbps. • High Performance Peripheral Interface (HIPPI): Designed to meet the needs of supercomputing applications, the HIPPI channel can deliver 800 Mbps, over a 32-bit parallel bus, whose length can be upto 25 meters. • InfiniBand Architecture (IBA): IBA defines a System Area Network (SAN) for connecting multiple independent processor platforms (i.e., host processor nodes), I/O platforms, and I/O devices. The IBA SAN is a communications and management infrastructure supporting both I/O and inter-processor communications (IPC) for one or more computer systems. An IBA system can range from a small server with one processor and a few I/O devices to a massively parallel supercomputer installation with hundreds of processors and thousands of I/O devices. Furthermore, the internet protocol (IP) friendly nature of IBA allows bridging to an internet, intranet, or connection to remote computer systems. The Fiber Channel (FC) is a new standard prepared by the ANSI X3T9.3 committee that aims at providing an efficient LIN network that can operate at speeds of gigabits per second. FC is designed to provide a general transport vehicle supporting all the existing peripheral standards mentioned above. This is achieved through the use of bridges which enable data streams from existing protocols to be supported within the FC sub-network. In this case, the FC provides a replacement of the physical interface later, thereby offering various benefits including improved distance and speed. In this section, we'll focus on the main features of the Fiber Channel and HIPPI standards because of their importance to the development of High Performance Distributed Systems. 3.6.1 Fiber Channel Standard Fiber Channel standard (FCS) has a five-layered structure to reduce the interdependency between the functional areas. This layered approach allows changes in technology to improve the implementation of one layer without affecting the design of other layers. For example, this is clearly illustrated at the FC-1 to FC-0 boundary, where the encapsulated data stream can be transmitted over a choice of multiple physical interfaces and media. The functions performed by each layer is outined below: • FC-4: Defines the bridges between existing channel protocols (IPI-3, SCSI, HIPPI, Block Mux etc.) and FCS. These bridges provide a continuity of system evaluation and provide a means of protecting the customer's investment in hardware and software while at the same time enabling the use of FCS capabilities. • FC-3: Defines the set of communication services, which is common across all nodes. These services are available to all protocol bridges defined in FC-4 layer. • FC-2: Defines the single frame protocol on which FCS communication is based. It also defines the control and data functions, which are contained within the frame format. • FC-1: Defines the encoding and decoding scheme, which is associated with the transmission frame stream. It specifies the special transmission sequences, which are required to enable communication between the physical interfaces. • FC-0: Defines the physical interface that supports the transmission of data through the FC network. This includes specifications for the fiber, connections, and transceivers. These specifications are based on a variety of media, each designed to meet a range of users, from low to high end implementations. 3.6.2 Infiniband Architecture IBA defines a switched communications fabric allowing many devices to concurrently communicate with high bandwidth and low latency in a protected, remotely managed environment. An endnode can communicate over multiple IBA ports and can utilize multiple paths through the IBA fabric. The multiplicity of IBA ports and paths through the network are exploited for both fault tolerance and increased data transfer bandwidth. IBA hardware off-loads from the CPU much of the I/O communications operation. This allows multiple concurrent communications without the traditional overhead associated with communicating protocols. The IBA SAN provides its I/O and IPC clients zero processor-copy data transfers, with no kernel involvement, and uses hardware to provide highly reliable, fault tolerant communications. Figure 3.14 Infiniband Architecture System Area Network An IBA System Area Network consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and routers. IBA handles the data communications for I/O and IPC in a multi-computer environment. It supports the high bandwidth and scalability required for IO. It caters to the extremely low latency and low CPU overhead required for IPC. With IBA, the OS can provide its clients with communication mechanisms that bypass the OS kernel and directly access IBA network communication hardware, enabling efficient message passing operation. IBA is well suited to the latest computing models and will be a building block for new forms of I/O and cluster communication. IBA allows I/O units to communicate among themselves and with any or all of the processor nodes in a system. Thus an I/O unit has the same communications capability as any processor node. An IBA network is subdivided into subnets interconnected by routers as illustrated in Figure 3.15. Endnodes may attach to a single subnet or multiple subnets. Figure 3.15 IBA Network Components An IBA subnet is composed of endnodes, switches, routers, and subnet managers interconnected by links. Each IBT device may attach to a single switch or multiple switches and/or directly with each other. The semantic interface between the message, data service and the adapter is referred to as IBA verbs. Verbs describe the functions necessary to configure, manage, and operate a host channel adapter. These verbs identify the appropriate parameters that need to be included for each particular function. Verbs are not an API, but provide the framework for the OSV to specify the API. IBA is architrected as a first order network and as such it defines the host behavior (verbs) and defines memory operation such that the channel adapter can be located as close to the memory complex as possible. It provides independent direct access between consenting consumers regardless of whether those consumers are I/O drivers and I/O controllers or software processes communicating on a peer to peer basis. IBA provides both channel semantics (send and receive) and direct memory access with a level of protection that prevents access by non participating consumers. The foundation of IBA operation is the ability of a consumer to queue up a set of instructions that the hardware executes. This facility is referred to as a work queue. Work queues are always created in pairs, called a Queue Pair (QP), one for send operations and one for receive operations as shown in Figure 3.16. In general, the send work queue holds instructions that cause data to be transferred between the consumer’s memory and another consumer’s memory, and the receive work queue holds instructions about where to place data that is received from another consumer. The other consumer is referred to as a remote consumer even though it might be located on the same node. Figure 3.16 Consumer Queuing Model The architecture provides a number of IBA transactions that a consumer can use to execute a transaction with a remote consumer. The consumer posts work queue elements (WQE) to the QP and the channel adapter interprets each WQE to perform the operation. For Send Queue operations, the channel adapter interprets the WQE, creates a request message, segments the message into multiple packets if necessary, adds the appropriate routing headers, and sends the packet out the appropriate port. The port logic transmits the packet over the link where switches and routers relay the packet through the fabric to the destination. When the destination receives a packet, the port logic validates the integrity of the packet. The channel adapter associates the received packet with a particular QP and uses the context of that QP to process the packet and execute the operation. If necessary, the channel adapter creates a response (acknowledgment) message and sends that message back to the originator. Reception of certain request messages cause the channel adapter to consume a WQE from the receive queue. When it does, a CQE corresponding to the consumed WQE is placed on the appropriate completion queue, which causes a work completion to be issued to the consumer that owns the QP. The devices in an IBA system are classified as: Switches, Routers, Channel Adapters, Repeaters, and Links that interconnects switches, routers, repeaters, and channel adapters The management infrastructure includes subnet managers and general service agents. IBA provides Queue Pairs (QP). The QP is the virtual interface that the hardware provides to an IBA consumer and it provides a virtual communication port for the consumer. The architecture supports up to 224 QPs per channel adapter and the operation on each QP is independent from the others. Each QP provides a high degree of isolation and protection from other QP operations and other consumers. Thus a QP can be considered a private resource assigned to a single consumer. The consumer creates this virtual communication port by allocating a QP and specifying its class of service. IBA supports the services shown in Table 3.2. Table 3.2 Service Types supported by QP Discussion Gigabit networks represent a change in kind, not just degree. Substantial progress must be made in the areas of protocol, high-speed computer interfaces and networking equipment. The challenge of the 90's will be resolving these problems as well as providing the means for disparate networking approaches to communicate with each other smoothly and efficiently. Ultimately, however, it seems that we are moving towards a single, public, networking environment biult on an infrastructure of gigabit-speed fiber optic links, most probably defined at the physical layer by a FDDI or Sonet standard. ATM protocol support multimedia information like voice, video and data in one integrated networking environment. Infiniband supports connections between many nodes and processors spanning many networks and for applications requiring different QoS. 3.7 Wide Area Networks (WANs) WANs are built to provide communication solutions for organizations or people who need to exchange digital information between two distant places. Since the distance is large, the local telecommunication company is involved, in fact, WANs are usually maintained by the country's public telecommunication companies (PTT's - like AT&T, Sprint), which offer different communication services. The main purpose of a WAN is to provide reliable, fast and safe communication between two or more places (Nodes) with low delays and at low prices. WANs enable an organization to have one integral network between all its departments and offices, even if they are not all in the same building or city, providing communication between the organization and the rest of the world. In principle, this task is accomplished by connecting the organization (and all the other organizations) to the network nodes by different types of communication strategies and applications. Since WANs are usually developed by the PTT of each country, their development is influenced by each PTT's own strategies and politics. The basic WAN service that the PTT usually offers (for many years) is a Leased Line. A Leased Line is a point-to-point connection between two places, implemented by different transmission media (usually through PSTN Trunks), which creates one link between its nodes. An organization whose networks are based on such lines has to connect each office with one line, meaning that each office is connected to as many lines as the number of offices it is connected to. The Packet Switched WAN appeared in the 1960's, and defined the basis for all communication networks today. The principle in Packet Switched Data Network (PSDN) is that the data between the nodes is transferred in small packets. This principle enables the PSDN to allow one node to be connected to more than one other node through one physical connection. That way, a fully connected network, between several nodes, can be obtained by connecting each node to one physical link.Another advantage for Packet Switching was the efficient use of resources by sharing the Network bandwidth among the users (instead of dividing). The first communication Packet Switched Networks were based on the X.25 packet switching protocol. X.25 networks became the de facto standard for non permanent data communication and was adopted by most PTT's. X.25 networks enabled cheaper communication, since their tariff was based on the communication time and the amount of data transferred. X.25 networks used the PTT's transmission networks more efficiently since the bandwidth was released at the end of the connection, or when no data was transmitted. Another advantage of X.25 was that it allowed easy implementation of international connections enabling organizations to be connected to data centers and services throughout the world. By the 1980's, X.25 networks were the main international channel for commercial data communication. Today to meet the high speed demands, the WANs rely on technologies ATM (BISDN) Frame Relay, SONET and SDH. We have already discussed ATM, SONET and SDH in the previous section on MAN. 3.8 Wireless LANS 3.8.1 Introduction Wireless LANs provide many convenient facilities that are not provided by traditional LANs such as mobility, relocation, ad hoc networking and coverage of locations difficult to wire. Wireless LANs were not of much practical use until the recent past, due to many technological and economical reasons such as high prices, low bandwidth, transmission power requirements, infrastructure and licensing. These concerns have been adequately addressed over the last few years, and the popularity of wireless LANs are increasing rapidly, and a new standard, namely IEEE 802.11, attempts to standardize these efforts. 3.8.2 IEEE 802.11 IEEE 802.11 defines a number of services that wireless LANs are required to provide functionality equivalent to wired LANs. The services specified in this standard are: 1. Association: Establishes an initial association between a station and an access point. A LAN's identity and address must be known and confirmed before the station can start transmitting. 2. Re-association: Enables an established connection to be transferred from one access point to another, allowing a mobile user to move from one station to another. 3. Disassociation: A notification from either a station or an access point that an existing association is terminated. A mobile station has to notify the base station before it shuts down; however, the base stations have a capability to protect themselves against stations that shut down without any notification. 4. Authentication: Used to establish the identity of the stations to each other. In a wired LAN, the physical connection always conveys the identity of other station. Here, an authentication scheme has to be used to establish the proper identity of thestations. Though the standards do not specify any particular authentication scheme, the methods used can range from the relatively unsecure handshaking to a public-key encryption scheme. 5. Privacy: used to prevent the broadcasted messages being read by users other than the intended recipient. The standard provides for an optional use of encryption to provide a high level of privacy. The 802.11 standard also specifies three kinds of physical media standards for wireless LANs: - Infrared at 1 and 2 Mbps operating at a wavelength between 850 and 950nm - Direct-sequence spread-spectrum operating in the 2.4-GHz ISM band, upto seven channels with data rates of 1 or 2 Mbps can be used. - Frequency hopping spread spectrum operating in the 2.4-GHz ISM band 3.8.3 Classification Wireless LANs are classified according to the transmission techniques used, and all current wireless LAN products fall into one of the three following categories: 1. Infrared LANs: In this case, an individual cell of an IR LAN is limited to a very small distance, such as a single room. This is because infrared rays cannot penetrate opaque walls. 2. Spread Spectrum LANs: This makes use of spread-spectrum technology for transmission. Most of the networks in this category operate in the bands where no licensing is required from the FCC. 3. Narrowband Microwave: These LANs operate at very high microwave frequencies, and FCC licensing is required. 3.8.4 Applications of Wireless LAN Technology 1. Nomadic Access Nomadic access refers to the wireless link between a wireless station and a mobile terminal (a notebook or laptop) equipped with an antenna. This gives access to users who are on the move, and wish to access the main hub from different locations. 2. Ad-Hoc Networking An ad-hoc network is a network set-up temporarily to meet some immediate need, such as conferences and demonstrations. No infrastructure is required for an adhoc network, and with the help of wireless technologies, a collection of users within range of each other may dynamically configure themselves into a temporary network. 3. LAN Extension A wireless LAN saves the cost of installation of LAN cabling and eases the task of relocation and other modifications and extensions to the existing network structure. 4. Cross Building Interconnect Point-to-point wireless links can be used to connect LANs in nearby buildings, independent of whether the buildings within themselves have wired or wireless LANs. Though this is not a typical use of wireless LANs, this is also usually included as an application for the sake of completeness. Summary Computer networks play an important role in the design of high performance distributed systems. We have classified computer networks into four types: Peripheral Area Networks, Local Area Networks, Metropolitan Area Networks and Wide Area Networks. For each class, we discussed the main computer technology that can be used to build high performance distributed systems. These networks include HIPPI, Fiber Channel, FDDI, DQDB, and ATM. Gigabit networks represents a change in kind, not just degree. Substantial progress must be made in the areas of protocols, high speed computer interfaces and networking equipment before they can be widely used. The challenge of the 90's will be resolving these problems as well as providing the means for disparate networking approaches to communicate with each other smoothly and efficiently. Ultimately, however, it seems that we are moving toward a single, public networking environment built on an infrastructure of gigabit-speed fiber optic links, most probably defined at the physical layer by a merged Fiber Channel/Sonet standard. ATM protocols will become popular format for sharing multimedia voice, video, and data in one integrated networking environment. Further, ATM has the potential to implement all classes of computer networks and provide the required current and future communication services. Problems 1.In some cases FDDI synchronous transmission causes a ``glitch'' in an audio or video signal that must be transmitted at a periodical rate. Describe a scenario where this glitch can occur and suggest a solution to solve this problem. 2.If source computer A wants to send a file of size 10 Kbytes to destination computer B and communication has to take place over a HIPPI channel. Explain how this exchange would take place, with regard to data framing and physical layer control signal sequencing. 3.Discuss the following: Enumerate the appropriate place for FDDI protocol into the OSI reference model. The use of claim frames in the ring initialization process. The guarantee of TTRT (target token rotation time), used by the FDDI protocol, for maximum delay on the ring. Enumerate 4.How can the Target Token Rotation Time affect the performance of a FDDI network? 5.What are the advantages of using copper as the transmission medium for FDDI? 6.What are the differences between FDDI II and FDDI? How is the former an improvement over the latter? 7.What is the most cost efficient configuration for a large FDDI network? Why? 8.What are the differences between Ethernet, FDDI and Token Ring standards for forming Local Area Networks? 9.DQDB MAC protocol is biased towards the nodes that are close of the slot generators. Explain this scenario and describe one technique to make DQDB protocol more fair. 10.ATM mainly uses virtual connections for establishing a communication path between nodes. Describe the relationship between virtual connection, virtual path, and virtual channel. Based on this relationship, how are the VPI and VCI identifiers used in performing cell switching in the ATM switches. Do you think the sizes of VPI and VCI fields are large enough for holding the required switching information? 11.What are the advantages of having a large packet size? What are the advantages of having a small packet size? Why does ATM prefer smaller packet sizes? 12.What are the different classes supported by the ATM Adaptation Layer? On what basis are the classes divided into, and how? 13.ISDN offers three main services. Describe these services and their applications to real life examples. 14.What is the SONET? Describe its capabilities and limitations. 15.What are the main characteristics of HIPPI. Discuss functions of HIPPI-FP (framing protocol) and HIPPI-LE (link encapsulation). References 1.Tanenbaum, A. S., 'Computer Networks', Prentice Hall, 1988 2.Stallings, W., `Local & Metropolitan Area Networks', Macmillan, 1995 3.Nim K. Cheung, ``The Infrastructure for Gigabit Computer Networks'', IEEE Communications Magazine, April, 1992, page 60. 4.E. Ross, `` An overview of FDDI: The Fiber Distributed Data Interface,'' IEEE Journal on Selected Areas in Communications, pp. 1043--1051, September 1989. 5.CCITT COM XVIII-R1-E, February 1985 6.J.L.Boudec, ``The Asynchronous Transfer Mode: a tutorial,'' Computer Networks and ISDN Systems 24(1992) North-Holland, pp.279-309.. 7.Spragins, J.D., Hammond, J.L., and Pawlikowski, K., ``Telecommunications Protocols and Design'', Addison Wesley, 1991. 8.Brett Glass, ``The Light at The End of The LAN,'' BYTE, pp269-274, July 1989. 9.Floyd Ross, ``FDDI-Fiber, Farther, Faster,'' Proceedings of the 5th Annual Conference on Local Computer Networks, April 8-10, 1986. 10.Marjory Johnson, ``Reliability Mechanisms of the FDDI High Bandwidth Token Ring Protocol,'' Proceedings of the 10th Annual Conference on Local Computer Networks, October 1985. 11.Michael Teener, ``FDDI-II Operation and Architecture's,'' Proceedings of the 14th Annual Conference on Local Computer Networks, October 1989. 12.William Stallings, ``FDDI Speaks,'' Byte, Vol. 18, No. 4, April 1983, page 199. 13.ANSII-X3.148, American National Standards Institute, Inc., ``Fiber Distributed Data Interface (FDDI) - Token Ring Physical Layer Protocol (PHY),'' X3.148-1988. 14.ANSII-X3.139, American National Standards Institute, Inc., ``Fiber Distributed Data Interface (FDDI) - Token Ring Media Access Control (MAC),'' X3.139-1987. 15.Tnagemann and K. Sauer, ``Performance Analysis of the Timed Token Protocol of FDDI and FDDI-II,'' IEEE Journal on Selected Areas in Communications, Vol. 9, No. 2, February 1991. 16.Saunders, ``FDDI Over Copper:Which Cable Works Best?,'' Data Communications, November 21, 1991. 17.Wilson, ``FDDI Chips Struggle toward the Desktop,'' Computer Design, February 1991. 18.J. Strohl, ``High Performance Distributed Computing in FDDI Networks,'' IEEE LTS, May 1991. 19.Stallings, ``Handbook of Computer-Communications Standards,'' Vol. 2, Local Network Standards, Macmillan Publishing Company, 1987. 20.William Stallings, ``Networking Standards: A guide to OSI, ISDN, LAN, and MAN standards'', Addison-Wesley, 1993. 21.Jain, ``Performance Analysis of FDDI Token Ring Networks: Effect of Parameters and Guidelines for Setting TTRT,'' IEEE LTS, May 1991. 22.Fdida and H. Santoso, `Àpproximate Performance Model and Fairness Condition of the DQDB Protocol,'' in High-Capacity Local and Metropolitan Area Network, edited by G. Pujolle, NATO ASI Serires, Vol. F72, pp. 267-283, Springer-Verlag Berlin Heidelberg 1991. 23.Mark J. Karol, and Richard D. Gitlin, "High performance Optical Local and Metorpolitan Area Networks: Enhancements of FDDI and IEEE 802.6 DQDB," IEEE J. on Selected Areas in Communications, Vol. 8, No. 8, October 1990, pp. 14391448. 24.Marco Conti, Enrico Gregori, and Luciano Lenzini, "A Methodology Approach to an Extensive Analysis of DQDB Performance and Fairness," IEEE J. on Selected Areas in Comm, Vol. 9, No. 1, January 1991, pp. 76-87. 25.Fiber Channel Standard-XT/91-062. 26.American National Standard for Information Systems, Fiber Channel: Fabric Requirements (FC-FG), FC-GS-92-001/R1.5, May 1992. 27.Cummings, "New Era Dawns for Peripheral Channels, " Laser Focus World, September 1990, pp. 165-174. 28.Cummungs, ``Fiber Channel- the Next Standard Peripheral Interface and More,'' FDDI, Campus-wide and Metropolitan Area Networks SPIE Vol. 1362, 1990, pp. 170-177. 29.J. Cypser, ``Communications for Cooperating Systems: OSI, SNA, and TCP/IP,'' Addison-Wesley, Reading, 1991. 30.Doorn, ``Fiber Channel Communications: An Open, Flexible Approach to Technology,'' High Speed Fiber Netowrks and Channels SPIE Vol. 136, 1991, pp. 207-215. 31.Savage, `Ìmpact of HIPPI and Fiber Channel Standards on Data Delivery,'' Tenth IEEE Symposium on Mass Storage, May 1990, pp. 186-187. 32.American National Standard for Information Systems, Fiber Channel: Physical and Signaling Innterface (FC-PH), FC-P-92-001/R2.25, May 1992. 33.Martin De Prycker, `Àsynchronous Transfer Mode: Solution for Broadband ISDN'', 2nd edition, ELLIS HORWOOD, 1993. 34.ITU-T 1992 Recommendation I.362, ``B-ISDN ATM Adaptation Layer (AAL) Functional Description,'' Study Group XVIII, Geneva, June 1992. 35.ITU2 ITU-T 1992 Recommendation I.363, ``B-ISDN ATM Adaptation Layer (AAL) Specification,'' Study Group XVIII, Geneva, June 1992. 36.Bae et.al., ``Survey of Traffic Control Schemes and Protocols in ATM Networks,'' Proceedings IEEE, Feb. 1991, pp. 170-188. 37.G.Gallassi, G. Rigolio, and L. Fratta, `ÀTM: Bandwidth Assignment and Bandwidth enforcement policies, '' Proceedings of IEEE GLOBECOM 1989, pp. 49.6.1-49.6.6. 38.Jacquet and Muhlethaler, Àn analytical model for the High Speed Protocol DQDB', High Capacity Local and Metropolitan Area Networks, Springer-Verlag, 1990 39.Bertsekas and Gallagher, Data Networks, Prentice Hall, 1995 40. Pahlavan, K. and Levesque, A. "Wireless Information Networks", New York, Wiley 1995 41. Davis, P.T. and McGuffin C.R. "Wireless Local Area Networks - Technology, Issues and Strategies", McGraw Hill, 1995 42. Infiniband Architecture Specification, Volume 1, Release 1.1, Infiniband Trade Association 43. H. J. Chao and N. Uzun, "An ATM Queue Manager Handling Multiple Delay and Loss Priorities," IEEE/ACM Transactions on Networking, Vol. 3, NO. 6, December 1995. 44. C.R. Kalmanek, H. Kanakia, S. Keshav, ``Rate Controlled Servers for Very HighSpeed Networks,'' Proceedings of the IEEE GLOBECOMM, 1990. 45. I. Dalgic, W. Chien, and F.A. Tobagi, "Evaluation of 10BASE-T and 100BASE-T Ethernets Carrying Video Audio and Data Traffic," INFOCOM'94 Vol 3 1994, pp. 10941102. 46. Gopal Agrawal, Biao Chen, and Wei Zhao. Local synchronous capacity allocation schemes for guaranteeing message deadlines with the timed token protocol. In IEEE Infocom 93, volume 1, 1993. Chapter 4 High Speed Communication Protocols Objective of this chapter: The main objective of this chapter is to review the basic principles of designing High Speed communication protocols. We will discuss the techniques that have been proposed to develop High Speed transport protocols to alleviate it of the slow-software-fast transmission bottleneck associated with High Speed networks. Key Terms QoS, TCP, IP, UDP, Active Network, SIMD, MISD, Parallelism With the recent developments in computer communication technologies, it is now possible to develop High Speed computer networks that operate at data rates in gigabit per second range. These networks have enabled high performance distributed computing and multimedia applications. The requirements of these applications are more diverse and dynamic than those of traditional data applications (e.g., file transfer). The existing implementation of standard communications protocols do not exploit efficiently the high bandwidth offered by high speed networks and consequently cannot provide applications that required high throughput and low latency communication services. Moreover, these standard protocols neither provide the flexibility needed to select the appropriate service that matches the needs of the particular application, nor they support guaranteed Quality of Service (QOS). This has intensified the efforts to develop new high performance communication protocols that can utilize the enormous bandwidth offered by High Speed networks. 4.1 Introduction When networks operated at several Kbps, computers had sufficient time to receive, process and store incoming packets. However, with high speed networks operating at gigabit per second (Gbps) or even at terra bit per second (Tbps), the existing standard communication protocols are not able to keep up with the network speed. Protocol processing time becomes a significant portion of the total time to transmit a packet. For example, a receiving unit has 146 milliseconds (146×10-3 seconds) to process a packet (of size 1024 bytes) when the network operates at 56 Kbps, while the same receiving unit has only 8 microseconds (8×10-6 seconds) to process packets of the same size when the network operates at Gbps [Port, 1991]. 89 In addition to the limited time, a computer has to process incoming packets transmitted at Gbps rate, new applications are emerging (e.g., interactive multimedia, visualizations, and distributed computing) that require protocols to offer a wide range of communication services with varying quality of service requirements. Distributed computing increases the burden on the communication protocols with additional requirements such as providing low latency and high throughput transaction-based communications and group communication capabilities. Distributed interactive multimedia applications involve transferring video, voice, and data within the same application, where each media type has different quality of service requirements. For example, a full motion video requires high-throughput, low-delay, and modest error rate, whereas bulk data transfer requires high throughput and error free transmission. The current implementations of standard transport protocols are unable to exploit efficiently high-speed networks, and furthermore cannot adequately meet the needs of the emerging network-based applications. Many researchers have been looking for ways to solve these problems. The proposed techniques followed one or more of three approaches: improve the implementation of standard transport protocols, introduce new adaptive protocols or implement the protocols by using a specialized hardware. In this chapter we first present the basic functions of transport protocols such as connection management, acknowledgment, flow control and error handling, followed by a discussion on the techniques proposed to improve latency and throughput of communication protocols. 4.2 Transport Protocol Functions The transport layer is the fourth layer of the ISO OSI reference model. The transport layer performs many tasks such as managing the end-to-end connection between hosts, detecting and correcting errors, delivering packets in a specified order to higher layers, and providing flow control [Port, 1991]. Transport protocols are required to utilize the network communication services, whether they are reliable or not, and provide communication services that meet the quality of service of the transport service users. Depending on the disparity between the network services and the required transport services, the design and implementation of the transport protocol functions vary substantially. The main functions of a transport protocol are: 1) connection management that involves initiating and terminating a transport layer association; 2) acknowledgment of received data; 3) flow control to govern the transmission rate in order to avoid over runs at the receiver and/or to avoid congesting the network; and 4) error handling in order to detect and correct errors occurred during the data transmission if required by transport service users. 4.2.1 Connection Management Connection management mechanisms address the techniques that can be used for establishing, controlling and releasing the connections required to transfer data between two or more entities. They also deal with synchronizing the state of the communicating entities. Where the state information is defined as the collection of data needed to 90 represent the state of transmission (e.g. the number of data already received, rate of transmission, packets already acknowledged, etc.). Transport protocols can be classified based on the amount of state information they maintain into two types: connectionless and connection-oriented transport protocols. In connectionless transport protocols, data are exchanged in datagrams without the need to establish connections. The transport system pushes the datagrams into the network that then attempts to deliver the datagrams to their destinations. There is no guarantee that datagrams will be delivered or that they will be delivered in the order of their transmission. Connectionless systems provide ``best-effort'' delivery. In this service, there is no state information that needs to be retained at either the sender or receiver and therefore the transport service is not reliable. In connection-oriented transport protocols, a connection is established between the communicating entities. Data is then transferred through the connection. The connection is released when all the data has been transmitted. During the transfer, the state of the entities is synchronized to maintain flow and error control, which ensures the reliability of the transport system. The main functions associated with connection management can be described in terms of the techniques used to achieve signaling, connection setup and release, selection of transport services, multiplexing, control information, packet formats, and buffering techniques [Doer, 1990]. Signaling The connection management functions can vary depending on the signaling protocol. Signaling protocol is used to setup and release connections and for exchanging state information. The signaling information can be transferred in two ways: on the same connection used for data transfer (in-band signaling) or on a separate connection (out-ofband signaling). The in-band signaling increases the load on the receiving unit because for every incoming packet it needs to determine whether or not the received packet is a data or a control packet. The out-of-band signaling separates the data and control information and sends them on different connections. Consequently, the transport protocol can support additional services (billing or security) without directly impacting the performance of the data transfer service. Connection Setup and Release The connection management of setup and release can follow two techniques: Handshake (explicit) schemes or implicit (timer-based) schemes. In explicit connection setup, handshake control packets are exchanged between the source and destination nodes to establish the connection before the data is transferred. This means that there is extra overhead in terms of packet exchanges and delay. The handshaking can follow either a two or three way protocol. In addition to exchanging messages according to the 91 handshake protocol, information about the connection must be maintained for a given period of time to ensure that both sides of a connection close reliably. In implicit connection setup no separate packets are exchanged to establish the connection. Instead, the connection is established implicitly by the first data packet sent and later the connection is released upon the expiration of a timer. Similarly in handshake schemes, state information must be maintained at both ends of the connection for a predetermined period of time after the last successful exchange of data. Determining the time period to hold state information is a difficult task, especially in networks that experience unbounded network delays. Selecting Transport Service The transport protocol defines several mechanisms in order to adjust its services according to the underlying supported network services and the applications of the transport service. Most protocols (e.g., TCP, OSI/TP4, XTP) negotiate the parameters (maximum packet size, timeout values, retry counters and buffer sizes) during the setup of a connection. Other protocols (e.g., TCP, OSI/TP4, XTP, VMTP) update some parameters (e.g., flow control parameters and sequence numbers) continuously during the data transfer phase in connection oriented service mode. In addition to this, some transport protocols support additional operation modes (e.g., no error control mode, block error control mode) that can be selected dynamically during the data transfer phase. In general, having further flexibility in choosing the protocol mechanisms and parameters and modify the parameters or change the mode of operating is an important property for supporting the emerging high performance network applications that vary significantly in their quality of service requirements. Multiplexing Multiplexing is defined as the ability to connect several data connections from one layer over a single connection provided by the lower layer [Doer, 1990]. Transport protocols can be classified based on their ability to map several transport connections over one single network connection. For example, TCP, OSI/TP4 and XTP support multiplexing at the transport layer. In connection-oriented networks, the virtual circuit is used to distinguish between the multiple transport connections that share the same network connection. In datagram networks, transport layer packets are identified by the association of the source and destination addresses. These packets share one single network connection. Other protocols such as Datakit and APPN [Doer, 1990] do not support multiplexing and there is one-to-one correspondence between transport and network connections. Multiplexing provides a cost-effective way to share the network resources, but it incurs extra overhead in demultiplexing the packets at the receiver side and making sure the shared resources are used fairly by all the transport layer connections or packets. Furthermore, in High Speed broadband networks (e.g., ATM), the benefit of multiplexing at the transport layer is questionable. 92 Control Information The transport protocol entities at both ends exchange some control information to synchronize their states, acknowledge receiving correctly or incorrectly the packets, flow control, etc. The control information can be all packed into one control message that is transmitted during the exchange of control state information. Another approach is to use separate messages to pass information about some control state variables or put them in the packet headers. Some protocols exchange control information periodically, independent of other events related to the transport connection. This simplifies significantly the error recovery scheme and facilitate parallel processing. Packet Formats The packet format defines layout of the packet fields that carry control information and user data. The size and the alignment of packet fields have significant impact on the performance of processing these packets. In the 1970s when the network was slow, the main bottleneck was the packet transmission delay. That has led to the use of complex packet formats with variable field lengths in order to reduce the amount of information transmitted in the network. This results in increasing the complexity of packet processing which was then acceptable because computers have plenty time to decode and process these packets. However, with high speed networks, care must be taken to use fixed size fields and properly allocate them in order to reduce packet decoding and allows parallel processing of the packet fields. For example, XTP protocol makes most protocol header fields 32 bits wide. Furthermore, the placement of the fields can impact significantly the performance of packet processing. One can put the information that defines the packet in the header to simplify decoding and multiplexing, whereas the checksum field should be placed after the data being protected so that they can be computed in parallel with data transmission or reception. Buffer Management Buffering at the transport layer involves, writing data to a buffer as it is received from the network, formatting the data for the user by sorting packets and separating data from header, moving the data to a space accessible by the user, and populating buffers for outgoing packets. In general, protocol implementations should minimize the number of memory read and write instructions since memory operations consume a large percentage of the total processing time [Clar, 1989]. One optimization is to calculate the checksum as the data is being transferred from the network to the user space. This requires that the CPU move the data, rather than a DMA controller. This optimization has been shown to reduce the processing time required to calculate the checksum and perform the memory-to-memory copy by 25%[Clar, 1989]. 93 4.2.2 Acknowledgment Acknowledgment is used to notify the successful receiving of user data and can be sent either as explicit control messages or within the header of a transmitted message. For example, in request-response protocols (e.g., VMTP), the response for each request is used as an acknowledgment of that request. In general, acknowledgments are generated at the receiver in response to explicit events (sender-dependent acknowledgment) or implicit events (sender-independent acknowledgment). In sender-dependent acknowledgment scheme, the receiver generates acknowledgment either after each packet received or after a certain number of packets depending on the availability of resources and how acknowledgments are used in the flow and error control schemes as will be discussed in the next two subsections. In the sender-independent acknowledgment scheme, the receiver periodically acknowledges messages received under timer control. This scheme simplifies the receiver tasks since it eliminates the processing time required to determine whether or not acknowledgments need to be generated when messages arrive at the receiver side. 4.2.3 Flow Control Flow Control mechanisms are needed to ensure that the receiving entity, within its resources and capabilities, is able to receive and handle the incoming packets. Flow control plays an important role in high speed networks because packets could be transmitted at rates that overwhelm a receiving node. The node has no option except to discard packets and that will consequently lead to a severe degradation in performance. The transmission rate must be limited by the sustainable network transmission rate, and the rate at which the transport protocol can receive, process, and forward data to its users. Exceeding either of these two rates will lead to congestion in the network or losing data caused by overrunning the receiver. The main goal of the flow control scheme is to ensure that these restrictions on the transmission rates are not violated. The restriction imposed by the receiver can be met by enforcing an end-to-end flow control scheme, whereas the network transmission rate restriction can be met by controlling the access to the network at the network layer interface. The access control used by network nodes to protect its resources against congestion can be either explicit or implicit. In explicit scheme as in XTP protocol, the transmitter initially uses default parameters for its transmission. Later, the receiver modifies the flow control parameters. Furthermore, the flow control parameters can also be modified by the network nodes in order to protect itself against congestion. In other protocols, implicit network access scheme is used. Round trip delays are used as indication of network congestion. When this delay exceeds certain limit, the transmitter reduces significantly its transmission rate. The transmission rate is increased slowly to its normal transmission rate according to an adaptive algorithm (e.g., slow-start algorithm in TCP protocol). The two main methods for achieving end-to-end flow control are window flow control and rate flow control. In window flow control, the receiver specifies the maximum amount of data that it can receive at one time until the next update of the window size. 94 The window size is based on many factors such as the available buffer size at the receiver, and the maximum round trip network delay. The transmitter stops sending packets when the number of outstanding, unacknowledged, packets is equal to the window size. The window size is updated with the receipt of acknowledgments from the receiver and consequently allows the transmitter to resume its transmission of packets. Window update can be either accumulative (window update value is added to the current window) or absolute (the window update specifies a new window from a reference point in the packet number space). In absolute window update scheme, it is possible for the receiver to reduce the size of the window whenever it is required by the flow control scheme. In Rate control, timers are used to control the transmission rate. The transmitter needs to know the transmission rate and the burst size (the maximum amount of data that may be sent in a burst of packets. The transmitter sends data based on a state variable(s) that specifies the rate at which the data can be transmitted. A variant of the rate control scheme is to use the interpacket time (the minimum amount of time between two successive packet transmissions) instead of the transmission rate and the burst size. Some protocols (e.g., XTP and NETBLT) combine the two schemes, they use initially rate control for transmission as long as the window remains open and uses window-based scheme once the window is closed. It is believed that rate control is more suitable for high speed networks since it has low overhead and the rate can be adjusted to match the network speed; rate flow control has low overhead since it exchanges little control information (when it is required to adjust the transmission rate), and it is independent of the acknowledgment scheme used in the protocol [Doer, 1990]. 4.2.4 Error Handling Error handling mechanisms are required to recover from lost, damaged or duplicated data. If the underlying network does not guarantee transport user requirement of a reliable data transmission, then transport protocols must provide error detection, reporting techniques about the errors once they are detected, and error correction. Error Detection: Sequence numbers, length field, and checksums can be used for error detection. Sequence numbers are used to detect missing, duplicate, and out-of-sequence packets. These sequence numbers can also be either packet-based (as in OSI/TP4, NETBELT, VMTP) or byte-based (as in TCP, XTP). Length field can be used to detect the complete delivery of data. The length field is placed either in the headers or the trailers. VMTP protocol uses a slightly different scheme to indicate the length of the packet. It uses the bitmap to indicate a group of fixed size blocks that are present in any given packet. Checksums are used at the receiver to detect corrupted packets incurred during the transmission. Checksums can be applied to the entire packet or to only headers or trailers. For example, TCP checksums the whole transport protocol data unit (TPDU), 95 while XTP, VMPT and NETBLT apply the checksum to the header with optional data checksum. Error Reporting: To speedup the recovery from errors, the transmitter should be notified once the errors are detected. The error reporting techniques include negative or positive acknowledgment and selective reject. In a negative acknowledgment (NACK), NACK is used to identify the point from which data are lost or missing. In a selective reject, more information is provided to indicate all the data that the receiver is missing. Error Correction: The error recovery in most protocols is done by retransmission of the erroneous data. However, they differ in the technique used to trigger retransmission of the data such as timeouts or error reporting, and the amount of information they keep at the receiver side; some protocols might discard out-of-sequence data or temporary buffer it until receiving the missing data so the receiver can perform resequencing. Two schemes are normally used in error correction: Positive Acknowledgment with Retransmission (PAR) and Automatic Repeat Request (ARR). In PAR, the receiving entity will acknowledge only correct data. Damaged and lost data will not be reported. Instead, the transmitter will timeout and retransmit the unacknowledged data using a go-back-n scheme. In go-back-n, all packets starting from the unacknowledged (lost or corrupted) packets will be retransmitted. Other protocols rely on the information provided by the error reporting and can thus performs selective retransmission. This method can be used to recover from any error type and simplifies the receiving entity tasks. In Automatic Repeat Request (ARR), the receiving entity informs the transmitter the status of the received data and based on this information the transmitter determines which data needs to be retransmitted. The receiver can use two types of retransmission: goback-n or selective retransmission. In selective retransmission, only the lost (or damaged) data will be retransmitted. The receiving entity must buffer all packets that are received after the damaged (or lost) data and then re-sequencing them after it received the retransmitted data. With go-back-n scheme, the transmitter begins retransmission of data from the point where the error is first detected and reported. The advantage of the goback-n scheme is that the resulting protocol is simple since the receiver sends only NACK when it detects errors or missing data and does not have to buffer out of sequence data. However, the main disadvantage of this scheme is due to the bandwidth wasted in retransmission of successfully received data because of the gap between what data is received correctly at the receiver side and what has been transmitted at the sender side. As will be discussed later, the cost of this scheme can be very significant in high speed networks with large propagation delays; for the propagation delay period, the senders can transmit a huge amount of data that needs to be stored in the network. Retransmission of successfully received data can be avoided by using selective retransmission. This scheme is effective if the receiver has enough buffer memory to store data corresponding to two or three times the bandwidth-delay product [Bux, 1988]. However, this scheme is not effective if the network error rate is high. 96 4.3 Standard Transport Protocols The most commonly used transport protocols include the TCP/IP suite and OSI ISO TP4. The OSI transport protocol [Tann, 1988] became an ISO international standard in the 1984 after a number of years of intense activity and development. To handle different types of data transfers and the wide variety of networks that might be available to provide network services, five classes have been defined for the ISO TP protocol. These are labeled classes 0,1,2,3, and 4. Class 0 provides service for teletext terminals. Classes 2, 3, and 4 are successively more complex, providing more functions to meet specified service requirements or to overcome problems and errors that might be expected to arise as the network connections become less reliable. Transmission Control Protocol (TCP) [Tann, 1988 and Jain, 1990a], was originally developed for use with ARPAnet during the late 1960's and the 1970's, has been adopted as the transport protocol standard by the U.S. Department of Defense (DoD). Over the years it has become a de facto standard in much of the U.S. university community after being incorporated into BSD Unix. The Internet Protocol (IP) was designed to support the interconnection of networks using an Internet datagram service and it has been designated as a DoD standard. Figure 4-1 shows the TCP/IP protocol suite and its relationship to the OSI reference model. A p p lic a tio n P r e se n ta tio n A p p lic a tio n P r o c e ss S e ssio n T r a n sp o r t TCP N e tw o r k IP D a ta L in k C o m m u n ic a tio n N e tw o r k P h y sic a l Figure 4-1. The TCP/IP protocol suite and the OSI reference model. 4.3.1 Internet Protocol (IP) The ARPANET network layer provides a connectionless and unreliable delivery system. It is based on the idea of Internet datagrams that are transparently transported from the source host to the destination host, possibly traversing several networks. The IP layer is unreliable because it does not guarantee that IP datagrams ever get delivered or that they are delivered correctly. Reliable transmission can be provided using the services of upper 97 layers. The IP protocol breaks up messages from the transport layer into datagrams or packets of up to 64 Kilobytes (kB). Each packet is then transmitted through the Internet, possibly being fragmented into smaller units. When all the pieces arrive at the destination machine, they are reassembled by the transport layer to reconstruct the original message. An IP packet consists of a header part and a data part. The header has a 20 byte fixed part and a variable length optional part. The header format is shown in Figure 4-2. Bit no. 1 2 3 4 5 Version 6 7 8 1 2 IHL 3 4 5 6 7 8 Type of service Total Length Identification Fragment Offset Flags Time to Live Protocol Header Checksum Source Address Destination Address Options Padding Figure 4-2. The Internet Protocol (IP) header The fields of the IP packet are described below: Version: This field indicates the protocol version used with the packet. IHL: The Internet header length since the header length is not fixed. Type of service: packets can request special processing with various combinations of reliability and speed. For example, for video data, high throughput with low delay is far more important than receiving correct data with low throughput, and for text file transfer, correct transmission is more important than speedy delivery. Total length: This 16-bit field indicates the length of the packet for both header and data. It allows packets to be up to 65,536 bytes in size. Identification: When a packet is fragmented, this field allows the destination host to reassemble fragments as in the original packet. All the fragments of a packet contain the same identification value. Flags: this is a 3-bit field to allow (prevent) gateways to (from) fragment the packet. Fragment offset: When a gateway fragments a packet, the gateway sets the flags field for each fragment except the last fragment and updates the fragment offset field to 98 indicate the fragment position in the packet. All fragments except the last one must be a multiple of 8 bytes. Time to live (TTL): An 8-bit TTL field is a counter to limit packet lifetimes. Since different packets can reach the destination host through different routes, this field can restrict packets from looping in the network. Protocol: This field identifies the type of data and is used to demultiplex the packet to higher level protocol software. Header checksum: This field provides error detection for only the header portion in the packet. It is useful since the header may change in fragments. It uses 16-bit arithmetic to compute the one's complement of the sum of the header. Source address and Destination address: Every protocol suite defines some type of addressing that identifies networks and hosts. As illustrated in Figure 4-3, an Internet address occupies 32 bits and indicates both a network ID and a host ID. It is represented as four decimal digits separated by dots (e.g., 128.32.1.11). There are four different formats. 7 b it s 2 4 b i ts c la ss A 0 c la ss B 1 0 c la ss C 1 1 0 c la ss D 1 1 1 0 n e t id h o s t id 1 4 b it s n e t id 1 6 b it s h o s ti d 2 4 b it s n e tid 8 b it h o s ti d 2 8 b i ts m u lt ic a s t a d d r e s s Figure 4-3. Internet address formats. If a network has a lot of hosts, then class A addresses can be used because it assigns 24 bits to be used as host ID and thus it can accommodate up to 16 million hosts. Class C addresses allow more networks (16 thousands networks) but fewer hosts per network. Class D addresses are reserved for future use. Any organization with an Internet address of any class can subdivide the available host ID space to provide subnetworks (see Figure 4-4). This feature provides a subnetworking feature to the internet address hierarchy: subnetid and hostid within the hostid of the higher level of the hierarchy. 99 Standard class B 1 0 14 bits 16 bits netid hostid 14 bits 8 bits 8 bits subnetid hostid Subnetted class B 1 0 netid Figure 4-4. Class B Internet address with subnetting 4.3.2 Transport Control Protocol (TCP) The transmission control protocol provides a connection-oriented, reliable, full-duplex, byte-stream transport service to applications. A TCP transport entity accepts arbitrarily long messages from upper layer (user processes), breaks them up into pieces (segments) not exceeding 64 Kbytes, and sends each piece as a separate datagram. Since the IP layer provides an unreliable, connectionless delivery service, TCP has the logic necessary to provide a reliable, virtual circuit for a user process. TCP manages connections, sequencing of data, end-to-end reliability, and flow control. Each segment contains a sequence number that identifies the data carried in the segment. Upon receipt of a segment, the receiving TCP returns to the sender an acknowledgment. If the sender receives an acknowledgment within a time-out period, the sender transmits the next segment. If not, the sender retransmits the segment assuming that the data was lost. TCP has only one Transport Protocol Data Unit (TPDU) header format, which is at least 20 bytes, and it is shown in Figure4-5. 100 0 16 31 Source port Destination port Sequence number Piggyback acknowledgment FIN SYN RST EOM ACK URG TCP header length TCP header Window Checksum Urgent pointer Options (0 or more 32 bit words) Data Figure 4-5. TCP data unit format. The fields of the TPDU are described below: Ethernet 3 Ethernet 1 Ethernet 2 gateway Host Host Host Host Host Network ID process process process Host ID Port number Figure 4-6. Network ID, host ID, and port number in Ethernet 101 . Source port and Destination port: These two fields identify the end points of the connection. Each host assigns a 16-bit unique number for each port. Figure 4-6 shows the relation of network ID, host ID and port number, while Figure 4-7 shows the encapsulation of TCP data into an Ethernet frame. 16-bit TCP source port # 16-bit TCP dest. port # data protocol = TCP internet 32-bit source addr internet 32-bit dest. addr frame type = IP Ethernet 48-bit source addr Ethernet 48-bit dest. addr Ethernet header 14 TCP header data IP header TCP header data IP header TCP header data 20 8 Ethernet trailer 4 Ethernet frame Figure 4-7. Encapsulation in each layer for TCP data on an Ethernet Sequence number: This field gives the sequential position of the first byte of a segment. Acknowledgment (ACK): This field is used for positive acknowledgment and flow control; it informs the sender how much data has been received correctly, and how much more the receiver can accept. The acknowledgment number is the sequence number of the last byte arrived correctly at the receiver. TCP header length: this field is similar to the header field in the IP header format. Flag fields (URG, ACK, EOM, RST, SYN, FIN): The URG flag is set if urgent pointer is in use; SYN to indicate connection establishment; ACK indicates whether piggyback acknowledgment field is in use (ACK = 1) or not (ACK = 0); FIN indicates connection termination; RST is used for resetting a connection that has become ambiguous due to delayed duplicate SYNs or host crashes; and EOM indicates End of Message. Window: Flow control is handled using a variable-size sliding window. The window field contains the number of bytes the receiver is able to accept. If the receiving TCP does not have enough buffers, the receiving TCP set window field to zero to cease transmission until the sender receives a non-zero window value. Checksum: It is a 16-bit field to verify the contents of the segment header and data. TCP uses 16-bit arithmetic and takes one's complement of the one's complement sum. 102 Urgent pointer: It allows an application process to direct the receiving application process to accept some urgent data. If the urgent pointer marks data, TCP tells the application to go into urgent mode and gets the urgent data. 4.4.3 User Datagram Protocol (UDP) Not every application needs the reliable connection-oriented service provided by TCP. Some applications only require IP datagram delivery service and thus exchange messages over the network with a minimal protocol overhead. UDP is an unreliable connectionless transport protocol using IP to carry messages between source and destination hosts. Unreliable merely means that there are no techniques (acknowledgment, retransmission of lost packets) in the protocol for verifying whether or not the data reached the destination correctly. There are a number of reasons for choosing UDP as a data transport protocol. If the size of the data to be transmitted is small, the overhead of creating connections and ensuring reliability may be greater than the retransmit of the entire data set. Like TCP, UDP defines source and destination ports, and checksum. The length field gives the length in bytes of the packet’s header and data. Figure 4.8 shows UDP packet format. 0 16 31 S ource port D estination port L ength C hecksum D ata Figure 4.8. UDP packet format. 4.4 Problems with Current Standard Transport Protocols The current standard protocols were designed in 1970s when the transmission rate was slow (e.g., Kbps range) and unreliable. There is an extensive debate in the research community to determine whether or not these standard protocols are suitable for high speed networks and can provide distributed applications the required bandwidth and quality of service. Many studies have shown that protocol processing, data copying and 103 operating system overheads are the main bottlenecks that prevent applications from exploiting the full bandwidth of high speed networks. In this subsection, we highlight the problems of the current implementation of standard transport protocol functions when used in high speed networks. 4.4.1 Connection Management Existing standard protocols may not be flexible enough to meet the requirements of a wide range of distributed applications. For example, TCP protocol does not supply a mechanism for fast call setup or multicast transmission; two important features for distributed computing applications. In distributed computing applications, a short lived connection is needed and the overhead associated with connection setup is prohibitively expensive. Therefore, the option of implicit connection setup is a desirable feature. This method is used in some new transport protocols like Versatile Message Transaction Protocol (VMTP) and Xpress Transfer Protocol (XTP). This allows data to be transmitted along with the connection setup packet. Furthermore, the response or acknowledgment packets can also be used to transfer response data; in applications such as data base queries and requests for file downloads, it is desirable that the response to any of these requests would contain data and at the same time acknowledge the connection setup packet or receiving the request. Packet Format A characteristic of older transport protocols is that packets format was designed to minimize the number of transmitted bits. This is because previous transport protocols were designed when the network transmission rate was in Kbps range that is several orders of magnitude slower than the transmission rate of emerging high speed networks. This has resulted in packet formats with bit-packed architectures that require extensive decoding. Packet fields often have variable sizes to reduce the number of unnecessary bits transmitted, and may change location within different packet types. This design, while conserving bits, leads to unacceptable delays in acquiring, processing and storing packets in high speed networks; In ATM networks operating at OC-3 transmission rate (155 Mbps), the computer has roughly few microseconds to receive, process and store each incoming ATM cell. Consequently, the packet structure for high-speed protocols is critical. All fields within a packet must be of fixed length and should fall on byte or word (usually multiple byte) boundaries, even if this requires padding fields with null data. This leads to simpler decoding and faster software implementation. In addition, by placing header information in the proper order, parallel processing can reduce significantly the time needed to process incoming packets. Furthermore, efficient transmission requires the packet size to be large. The packet size in current transport protocols is believed to be too small for high speed networks. It is desirable to make the packet size large enough to exploit the high speed transmission rate and accommodate applications with bulk data transfer requirement. 104 Buffer management Buffer management processing is critical to achieve high performance transport protocol implementation. The main responsibilities of buffer management are writing data to a buffer as it is received, forwarding data to buffers for retransmission and reordering out of sequence packets. One study has shown that 50% of TCP processing time is used for network to memory copying [Port, 1991]. In layered network architectures, data is copied between the buffers of the different layers. This exhibits a significant overhead. A solution proposed by [Wood, 1989] called buffer cut-through. In this approach, one buffer is shared between the different layers and only the address of the packet is moved between the layers. This reduces the number of data copying. For bulk data transfer with large packet size, this approach provides a significant performance improvements. Another important issue is determining the buffer size that must be maintained at both the sender and receiver ends. For the sender to transfer data reliably, it needs to keep in its buffer all the packets that have been sent but not yet acknowledged. Furthermore, to avoid the stop-and-wait problem in high speed networks, the buffer size should be large enough to allow the sender to transmit packets in flight before stopping and waiting for acknowledgment. At the receiver side, it is important to have enough buffer space to store packets received out of sequence. One can estimate the maximum amount of buffering required to be enough to store all packets that can be transmitted during one-round trip delay over the longest possible path [Part, 1990]. 4.4.2. Flow Control The goal of flow control in a transport protocol is to match the data transmission rate with the receiver’s data consumption rate. The flow control algorithms used in the current transport protocols might not be suitable for high-speed networks [Port, 1991]. For example, window-based flow control adjusts the flow of data by bounding the number of packets that can be sent without any acknowledgement from the receiver. If the size of the window chosen to be small or inappropriate, the transmitter will send short bursts of data, separated by pauses, while the sender waits for permission to transmit again. This is referred to as lock-step transmission and should be avoided especially in high speed networks. In high speed networks, one needs to open a large window to achieve high throughput over long delay channels. But opening a large window has little impact on the flow control because windows only convey how much data can be buffered, rather than how fast the transmission should be. Moreover, the window mechanism ties the flow control and error control, and therefore becomes vulnerable in the face of data loss [Clar, 1989]. 4.4.3. Acknowledgment Most of the current transport protocol implementation use cumulative acknowledgment. In this scheme, acknowledging the successful receiving of a packet with sequence number, say N, indicates that all the packets up to N have been successfully received. In 105 high speed networks with long propagation delays, thousands of packets could have been transmitted before the transmitter receives an acknowledgement. If the first packet received in error, all the subsequent thousands of packets need to be retransmitted even though they were received without error. This severely degrades the effective transmission rate that can be achieved in high speed networks. Selective acknowledgment has been suggested to make the acknowledgment mechanism more efficient. However, it has more overhead as will be discussed in the next subsection. The concept of blocking has also been suggested as a means to reduce the overhead associated with acknowledgment [Netr, 1990]. 4.4.4. Error Handling Error recovery in most existing protocols use extensively timers that are inefficient and expensive to maintain. When packets are lost, most reliable transport protocols use timers to trigger the transmission state resynchronization. Variation in Round Trip Delays (RTD) requires that timers be set to longer than one RTD. Finding a good balance between a timer value that is too short and one that is too long can be very difficult. The performance loss due to timers is particularly high when the round trip time is long. The retransmission of undelivered or erroneous data, that can include an entire window of data or only the erroneous data, can either be triggered by transmitter timeouts while waiting for acknowledgment, or receiver initiating negative acknowledgment. The receiver can use go-back-N or selective acknowledgment scheme for its acknowledgment. The Go-back-N algorithm is easy to implement, because once a packet is received in error or out of sequence, all successive packets will be discarded. This approach reduces the amount of information to be maintained and buffered. Selective acknowledgment a scheme has more overhead since it needs to store all the packets received out-of-order. In high speed networks, the transmission error rate is low (on the order of l0-9) and most of the packet loss is due to network and receiver overruns. While Go-back-N is simple to implement, it does not solve the problem of receiver overruns [Port, 1991]. Selective acknowledgment schemes seem more suitable and their book keeping overhead can be reduced by acknowledging blocks instead of packets. If any packet in a block is delivered incorrectly, the entire block is retransmitted. As the window (maximum number of blocks to send) size increases, the throughput increases. One of the efficient error detection/correction methods is to transmit complete state information frequently and periodically between the sender and receiver regardless of the state of the protocol. This simplifies the protocol processing, by removing some of the elaborate error recovery procedures, and makes it easy to parallelize the protocol processing leading to higher performance [Netr, 1990]. Furthermore, periodic exchange of state information with blocking makes the throughput independent of the round trip delays while reducing the processing time. In this case, only one timer is needed, in the 106 receiver side, and must only be adjusted after each block time, not for each packet. However, in interactive applications, this might not be the case because these applications transmit a small amount of data, and latency is the key performance parameter [Port, 1991]. Although selective acknowledgement is more difficult to implement, the reduction in retransmitted packets can offer higher retained throughput under errored or congested conditions, and assist in alleviating network congestion. One study has shown that selective retransmission can provide approximately twice the bandwidth of Go-back-N algorithms under errored conditions. Results also showed that selective retransmission combined with rate control provides high-performance method of communication between two devices of disparate capacity [Port, 1991]. 4.5 High Speed Transport Protocols The applications of high performance distributed systems (distributed computing, realtime interactive multimedia, digital imaging repository, and compute-intensive visualizations) have bandwidth and low latency requirements that are difficult to achieve using the current standard transport protocols. Researchers have extensively studied the limitation of these protocols and proposed a wide range of solutions to address the limitations of current protocols in high speed networks. These techniques can be broadly grouped into software-based and hardware-based techniques (see Figure 4.9). High Speed Protocols Software Techniques Improve Existing Protocols Static Structure Hardware Techniques New Protocols Adaptive Structure VLSI Parallel Processing Host Interface Programmable Structure Figure 4.9. Classification of high speed transport protocols. The software-based approach can be further divided into those that aim at improving the implementation of existing protocols using upcalls and better buffer management techniques, or those that believe that the current protocols can not cope with high speed 107 networks and their applications so they propose new protocols that have either static or dynamic structures (see Figure 4.9). The hardware-based approach aims at using hardware techniques to improve protocol performance by implementing the whole protocol using VLSI chips (e.g., XTP protocol), by applying parallel processing techniques to reduce the processing times of protocol functions, or by off-loading all or some of the protocol processing functions to the host network interface hardware. In this subsection, we will briefly review the approaches used to implement high speed protocols and discuss few representative protocols. 4.5.1 Software-based Approach 4.5.1.1 Improve Existing Protocol Careful implementation of the communication protocol is very critical to improve performance. It has been argued that the implementation of the current transport protocols and not their designs is the major source of processing overhead [Clar, 1989]. The layering approach of standard protocols suffers from redundant execution of protocol functions at more than one layer and from excessive data copying overhead. Some researchers suggested improving the performance of standard protocols by changing their layered implementation approach or reduce the overhead associated with redundant functions and data copying. Clark [Darts] has proposed upcalls approach to describe the structure and the processing of a communication protocol. The upcalls approach reduces the number of context switching by making the sequence of protocol execution more logical. Other researchers attempt to reduce data copying by providing better buffer management schemes. Woodside et al. [Wood, 1989] proposed Buffer Cut-Through (BCT) to share the buffer between the network layers. In the conventional method, each layer has an input and output buffers shared only by the layer above it and below it as shown in Figure 4.10(a). In BCT, data buffer is shared among all layers as shown in Figure 4.10 (b). When layer i receives a signal from layer i+1 indicating a new packet, layer i accesses the shared buffer to read the packet, processes it, and passes the address of the packet along with a signal to layer I+1. Layer i + 1 Layer i + 1 Buffer Layer i Buffer Layer i Layer i - 1 Layer i - 1 (a) (b) 108 Shared Buffer Figure 4.10. (a) Buffers between each layer (b) Buffer Cut-Through approach. Another example is the Integrated Layer Processing (ILP) [Clar, 1990; Brau, 1995; Abbo, 1993]. In ILP all data manipulations of different layers are combined together instead of having data loaded and stored for every data manipulation required at each layer. This reduces the overhead associated with data copying significantly. X-Kernel [Hutc, 1991; Mall, 1992] is a communication-oriented operating system that provides support for efficient implementation of standard protocols. X-Kernel uses upcalls and an improved buffer management scheme to reduce drastically the amount of data copying required to process the protocol functions. In this approach, a protocol is defined as a specification of abstraction through which a collection of entities exchanges messages. The X-Kernel provides three primitive communication objects: protocols, sessions, and messages. A protocol object corresponds to a network protocol (e.g., TCP/IP, UDP/IP, TP4), where the relationships between protocols are defined at the kernel configuration time. A session object is a dynamically created instance of a protocol object that contains the data structures representing the network connection state information. Messages are active objects that move through the session and protocol objects in the kernel. To reduce the context switching overhead, a process executing in the user mode is allowed to change to the kernel mode (this corresponds to making a system call) and a process executing in the kernel mode is allowed to invoke a user-level function (this corresponds to upcall). In addition, X-Kernel provides a set of efficient routines that include buffer management routines, mapping and binding routines, and event manage routines. 4.5.1.2 New Transport Protocols The layered approach to implement communication protocols has an inherent disadvantage for high speed communication networks. This occurs because of the replication in functions in different layers (e.g., error control functions are performed both by data link and transport protocol layers), high overhead for control messages, and inability to apply efficiently parallel processing techniques to reduce protocol processing time. Furthermore, the layered approach produces an optimal implementation for each layer rather than producing an overall optimal implementation of all the layers. Many new protocols were introduced to address the limitations of standard protocols. These approaches tend to optimize the implementation of existing protocols to suite certain classes of applications. In what follows, we discuss three representative research protocols: Network Block Transfer (NETBLT) protocol, Versatile Message Transfer Protocol (VMTP), Xpress transport protocol (XTP). Network Block Transfer (NETBLT) NETBLT is a transport protocol [Clar, 1987] intended for the efficient transfer of bulk data. The main features of NETBLT are as follows: Connection Management: NETBLT is a connection-oriented protocol. The sending and receiving ends are synchronized on a buffer level. The per-buffer interaction is more efficient than per-packet, especially in bulk data transfer applications over networks with 109 long delay channels. During data transfer, the transmitting unit breaks the buffer into a sequence of packets. When the whole buffer is completely received, the receiving station acknowledges it so that the transmitting node can move on to transmit the next buffer and so on until it completes transmitting all the data. To reduce the overhead in synchronizing the transmitting and receiving states, NETBLT maintains a timer at the receiving side. The estimated time to transmit a buffer is used to initialize the control timer at the receiving station. When the first packet of a buffer is received, the timer is set while it is reset when the whole buffer is received. Acknowledgement: The timer-based acknowledgement techniques are costly to maintain and difficult to determine the appropriate timeout intervals. NETBLT minimizes the use and the overhead associated with timers. NETBLT uses the timer at the receiver end. Furthermore, NETBLT uses selective acknowledgement to synchronize the states at the transmitter and receiver. In this acknowledgement scheme, NETBLT acknowledges the receiving of buffers (large data blocks) in order to reduce the overhead associated with acknowledgement of small packets. Flow Control: NETBLT flow control mechanism adopts rate-based flow control. Unlike window-based flow control, rate control works independently of the network round-trip delay and error recovery. Consequently, the sender transmits data at the current acceptable transmission rate as determined by the current network and receiver capacity to handle incoming traffic. If the network is congested the rate must be decreased. Also, there is no notion of error recovery working outside of the standard flow-control mechanism since NETBLT places retransmitted data in the same queue with new data. Thus all data leave the queue at the current transmission rate without any change. Moreover, rate control reduces reliance on timer and estimates timers more accurately since retransmission timer is based on the current transmission rate rather than round trip delay. Error Handling: NETBLT reduces the recovery time by placing the retransmission timer at the receiver side. The receiver timer can easily determine the appropriate timeout and which packets need to be retransmitted; the receiver can accurately estimate the timeout period because it knows the transmission rate and the expected number of packets based on the size of the buffer allocated at the receiver side [clark, 1988]. NETBLT synchronizes their connection states and recovery from errors via three control messages (GO, RESEND, and OK) with selective repeat mechanism. Although the control messages have more reliability, a control message will occasionally get lost. Consequently, NETBLT maintains a control message retransmit timer based on network round trip delay. Versatile Message Transaction (VMTP) VMTP is a transport protocol designed to support remote procedure calls, group communications, multicast and real-time communications [Cher, 1989; Cher, 1986]. The main features of this protocol can be outlined as follows: 110 Connection Management: VMTP is a request-response protocol to support transactionbased communications. Since most of its data units or transactions are small, VMTP uses implicit connection setup. The first packet is used for establishing the connection as well as carrying data. VMTP provides a streaming mode that can be used for file transfer, conversation support for higher-level modules, a stable addressing, and message transactions. The advantages of using transactions are higher-level conversation support, minimal packet exchange, easy of use and sharing communication code. Acknowledgment: VMTP uses selective acknowledgement to reduce the overhead incurred by retransmit correctly received packets. VMTP reduces the overhead of processing acknowledgement packets by using bit masking. The bit mask provides a simple fixed length way of specifying which packets were received as well as indicating the position of a packet within the packet stream. Error Handling: VMTP employs different techniques to deal with duplicates, allowing the most efficient technique (duplicate request packets, duplicate response packets, multiple response to a group message transaction and idempotent transactions) to be used in different situations. In contrast, TCP requires the use of a 3-way handshake for each circuit setup and circuit tear-down to deal with delayed duplicates. Flow Control: VMTP uses rate based flow control applied to a group of packets. In this case, the transmitter sends a group of packets as one operation. Then, the receiver accepts and acknowledges this group of packets as one unit before further data is exchanged. This packet group approach simplifies the protocol and provides an efficient flow control mechanism. XTP Transport Protocol (XTP) XTP is a communication protocol designed to support a wide range of applications ranging from distributed computing applications to real-time multimedia applications [Prot, 1989; Ches, 1987; Ches, 1987]. XTP provides all the functionalities supported by standard transport protocols (TCP, UDP and ISO TP4) plus new services such as multicasting, multicast group management, priority, quality of service support, rate and burst control, and selectable error and flow control mechanisms. XTP architecture supports efficient VLSI and parallel processing of the protocol functions. Connection Management: XTP supports three types of connection management: 1) Connection Oriented Service, Connectionless service, and Fast Negative Acknowledgement Connection. The XTP connection oriented service provides efficient reliable packet transmission. For example, the TCP and ISO TP4 reliable packet transmission requires exchanging six packets (two for setup and acknowledgement of the connection, two to send and acknowledge data transmission, and two to close and acknowledge the release of the connection). XTP uses three packets instead of six because of the use of implicit connection establishment and release mechanisms. The XTP connectionless service is similar to the UDP service that is a best effort delivery service; the receiver does not acknowledge receiving the packet and the transmitter never 111 knows whether or not the transmitted packet is delivered properly to its destination. The last option can lead to fast error recovery in special scenarios. For example, the receiver that recognizes that the packets arriving out of sequence can notify the transmitter immediately rather than waiting for the timeout to trigger the reporting. Flow Control: XTP Flow control allows the receiver to inform the sender about the state of the receiving buffers. XTP supports three mechanisms to implement flow control: 1) Window based flow control that is used for normal data transmission; 2) Reservation mode in order to guarantee that data will not be lost due to buffer starvation at the receiver side; and 3) No flow control that disables the flow control mechanism. The last mechanism might be useful for multimedia applications. XTP supports rate-based as well as window-based flow control. XTP uses rate control to restrict the size and time spacing of bursts of data from the sender. That is, within any small time period, the number of bytes that the sender transmits must not exceed the ability of the receiver (or intermediate routers) to process (decipher and queue) the data. There are two parameters (RATE, BURST) for the receiver to tune the data transmission rate to an acceptable level. Before the first control packet arrives at the transmitter, the transmitter sends the packet at the default rate. The RATE parameter limits the amount of data that can be transmitted per time unit, while the BURST parameter limits the size of the data that can be sent. Error Handling: XTP supports several mechanisms for error control such as go-back-n and selective retransmission. TCP responds to errors by using a go-back-n algorithm that might work in local area network, but it will degrade performance significantly in high speed networks and/or high latency (satellite networks). XTP supports selective retransmission, in addition to go-back-n, in which the receiver acknowledges spans of correctly received data and the sender retransmits the packets in the gaps. To speedup the processing of packet checksums, the checksum parameters are placed in the common trailer in XTP (unlike TCP whose checksum parameters are placed in front of the information segment). 4.5.1.3 Adaptable Transport Protocols The existing standard transport protocols tend to be statically configured; that is they define one algorithm to implement each protocol mechanism (flow control, error control, acknowledgement, and connection management). However, the next generation of network applications will have widely varying Quality of Service (QOS) requirements. For example, data applications require transmission with no error and have no time constraints, while real-time multimedia applications require high bandwidth coupled with moderate delay, jitter, and some tolerance of transmission errors. Furthermore, the characteristics of networks change dynamically during the execution of applications. These factors suggest an increasingly important requirement on the design of next generation high speed transport protocols to be flexible and adaptive to the changes in network characteristics as well as to application requirements. In adaptive transport protocol approach, the configuration of the protocol can be modified to meet the requirements of the applications. For example, bulk data transfer applications run 112 efficiently if the protocol implements an explicit connection establishment mechanism, whereas the protocol implements implicit connection management in transaction-based applications; an explicit connection establishment exhibits an intolerable overhead. Many techniques have been proposed to implement adaptive communication protocols [Dosh, 1992; Hosc, 1993; Hutc, 1989]. There are two methods to make a communication protocol adaptive: 1) the various protocol mechanisms can be optionally configured and incorporated during connection establishment by using genuine dynamic linking [Wils, 1991]; and 2) change the protocol mechanisms during the operation so protocols can optimize their operations for any instantaneous changes in network conditions and application QOS requirements. Many researches have proposed protocols that are functionally decomposed into a set of modules (or building blocks) that can be configured dynamically to meet the QOS requirements of a given application. In [Eyas, 1996; Eyas, 2002 ], a framework in which communication protocol configurations can be dynamically changed to meet applications requirements and network characteristics is presented. In this framework, a communication protocol is represented as a set of protocol functions, where each protocol function can be implemented using one (or more) of its corresponding protocol mechanisms. For example, flow control function can be implemented using windowbased mechanism or rate-based mechanism. Figure 4.11 describes the process of constructing communication protocol configurations using this framework. The user can either selects one of the pre-defined protocol configurations (e.g., TCP/IP) or customizes a protocol configuration. All the protocol mechanisms are stored in the Protocol Function DataBase (PFDB). Using the provided set of user-interface primitives, the user can program the required protocol configuration by specifying the appropriate set of protocol mechanisms. ■. Al-Hajery and S. Hariri, “Application-Oriented Communication Protocols for High-Speed Networks,” International Journal of Computers and Applications, Vol. 9, No. 2, 2002, pp. 90-101. ✻✥❙ ❁▲✐ ✒✽ 113 User service parameters Protocol Function Data Base (PFDB) predefined protocol tailored protocol protocolname Protocol Generation Unit Network Monitor protocol configuration specifications hardware specifications Protocol Implemeation Unit protocol implementation specifications Hardware Platform Figure 4.11. Adaptable Communication Protocol Framework. Furthermore, a flexible hardware implementation of the configured protocol is used to implement efficiently the communication protocols configured using that framework. Figure 4.12 illustrates an example of three adaptable protocols where each protocol is configured to meet the requirements of one class of applications. In Protocol 1, the configuration contains an explicit connection establishment and release, a Negative Acknowledgment (NACK) error reporting, a rate-based flow control, and a period synchronization. In a similar manner, Protocols 2 and 3 are configured with different types of protocol mechanisms. Receiver Conn. Mmgr. EE II NACK PACK Transmitter go-back-n Trans-based Rate-based Periodic Window 114 Protocol 1 Protocol 2 Protocol 3 Figure 4.12. A network-based implementation of adaptable protocol framework. 4.5.2 Hardware-based Approach The main advantage of this approach is that it results in fast implementation of protocols. However, the main limitation of hardware-based approach is that it is costly. This approach can also be further classified into three types: programmable networks in which the whole protocol functions are implemented by the network devices, Parallel Processing approach in which multiple processors are used to implement the protocol functions, and high–speed network interface approach in which some or all of the protocol functions are off-loaded from the host to run on that interface hardware. 4.5.2.1 Programmable Network Methodologies There are two main techniques that have been aggressively pursed to achieve network programmability: 1) Open Interface: this approach is spearheaded by the Open sig community, and 2) Active Network (AN): this approach is established by DARPA who funded several projects that address programmability at network, middleware, and application levels [CAMP, 2001; CLAV, 1998]. Open Interface: This approach models the communication hardware using a set of open programmable network interfaces that provide open access to switches and routers through well define Application Programming Interfaces (API). These open interfaces provide access to the node resources and allow third party software to manipulate or reprogram them to support the required services. Consequently, new services can be created on top of what is provided by the underlying hardware system as shown in Figure 4.13. Expose functionalities of Network Element (NE) to outside world Algorithm Open Interface Resource 115 Figure 4.13. Open Interface Approach. Recently, the IEEE Project 1520 [DENA, 2001] has been launched to standardize programmable interfaces for network control and signaling for a wide range of network nodes ranging from ATM switches, IP routers to mobile telecommunications network nodes. The P1520 Reference Model (RM) is shown in Figure 4.14 along with the functions of each of the four layers. The interfaces, as shown, supports a layered design approach where each layer offers services to the layer above while it is using the components below to build the layer services. Each level comprises a number of entities in the form of algorithms or objects representing logical or physical resources depending on the level’s scope and functionality. The P1520 model only defines the interfaces leaving the actual implementation and protocol specific details to the vendor/ manufacturer. P1520 Reference Model Users V interfaceAlgorithms for value-added } communication services created by network operators, users, and third parties U interface Algorithms for routing and connection management, directory services, … L interface CCM interface Virtual Network Device (software representation) Physical Elements (Hardware, Name Space) Value-added Services Level } Network Generic Services Level } } Virtual Network Device Level PE Level Figure 4.14. IEEE P1520 Reference Model. The reference model supports the following interfaces: 1) V interface – it provides access to the value-added services level 2) U interfaces – it deals with generic network services 3) L interface – it defines the API to directly access and manipulate local device network resource states 4) Connection Control and Management (CCM) interface- it is a collection of protocols (e.g., GSMP) that enable the exchange of state and control information between a device and an external agent. Active Network (AN) This approach adopts the dynamic deployment of new services at runtime. In this approach, code mobility (referred to as active packets) represents the main mechanism 116 for program delivery, control and service construction as shown in Figure 4.15. Packets are the means for delivering code to the remote nodes and form the basis for network control. Packets may contain executable code or instructions for execution of a particular program at the destination node. Depending on the application, the packets may carry data along with the instructions. The destination node based on the instructions process the attached data of the packet. Usually the first packet(s) carry only the instructions and the subsequent packets carry the data in a flow. The node at the other end processes the data packets based on the instructions from the first packet(s). The packets that carry the instructions are popularly referred to as mobile agents. Active network provides greater flexibility in deploying of new services in a network but also requires greater security and authentication. Before the execution of the code the remote node needs to authenticate the user, its privileges and check to ensure that the new operation is feasible under the current node system conditions and will not affect the existing processes running in the node. Flexibility to modify behavior of NE (Active Signalling, etc.) Active Device Processing Resource Figure 4.15. Active Network Approach. The active networks approach allows customization of network services at the packet transport level, rather than through a programmable control interface. Active network provides maximum complexity and it also adds extreme complexity to the programming model. 4.5.2.2 Host-Network Interface Every packet being transmitted or received goes through the host-network interface. The design of host network interface plays an important role in maximizing the utilization of 117 the high bandwidth offered by high-speed networks. Recently, there has been an intensified research effort to investigate the limitations of current host network interface designs and propose new architectures to achieve high performance network interfaces [Davi, 1994; Stee, 1992; Rama, 1992]. In a conventional network interface as shown in Figure 4.13, bus is used extensively during the data transfers. Host processor reads data from the user buffer and writes it to the kernel buffer. Protocol processing is performed, and then checksumming is calculated per-byte by reading from the kernel buffer. Finally, the data is moved to the network interface buffer from the kernel buffer. In this host-network interface, every word being transmitted or received will cross the system bus six times. Application 1 User Buffer 2 Host Processor 3 Kernel Buffering 4 5 6 Network Interface 1 Write data to user buffer 2,3 Move data from user to kernel buffer 4 Read data to calculate checksum code 5,6 Move data to network buffer Network Buffer Network User Buffer Application Host Kernel Buffer Processor DMA Network Interface Network Buffer 118 Network Figure 4.13. Bus-based host network interface. Figure 4.14. DMA-based host network interface. It is apparent that in the traditional network interface, there is a large number of data transfers, and this throttles the speed of data transmission. This was not a problem when this approach was first introduced in the early 70’s because the processor was three order of magnitude faster than the network. However, this situation is reserved now and the network is much faster than the processor in handling packets. In order to alleviate this overhead, data copying should be minimized. In Figure 4.14, an improved host-network interface is shown. In this design, Direct Memory Access (DMA) is used to move data between kernel buffer and network buffer. Checksum can be calculated on the fly using additional hardware circuitry while data is transferred. This makes the total number of data transfers equals to four in contrast to six using the previous interface. Also, checksum calculation is done in parallel with data transfer. This introduces some additional per-packet overheads such as setting up the DMA controller. However, for large data transfers, this additional overhead is insignificant when compared to the performance gain achieved in reducing the number of data transfers. The performance of host-network interface can be improved further by sharing the network buffer with the host computer as shown in Figure 4.15 [Bank, 1993]. In this design, data is directly copied from user buffer to network buffer (we refer to this design as zero copying host network interface), checksumming is done on the fly while data is copied. This design reduces the number of data transfers to only two. The network buffer should be large enough to enable protocol processing and to keep copy of the data until it is properly transmitted and acknowledged. User Buffer Application Host Processor DMA controller Network Interface Shared Buffer Figure 4.15. Zero copying host network interface. 119 Network In order to reduce the processing burden on the host, transport and network protocols can be offloaded from the host to host-network interface processor as shown in Figure 4.16 [Macl, 1991; Jain, 1990b; Kana, 1988]. There are several advantages of offloading protocol processing to the host-network interface such as reducing the load on the host computer, making the application processing on the host deterministic since protocol functions are executed outside the host, and eliminating the per-packet interrupt overhead that can overwhelm the processing of packet transmission and reception (e.g., the host can be interrupted only when a block of packets is received). Application Host User Buffer Processor DMA Protocol Processor Protocol General-purpose Buffer Processor DMA Network Interface Network Buffer Network Figure 4.16. Off-loading protocol processing to host network interface. 4.5.2.3 Parallel Processing of Protocol Functions The use of parallelism in protocol implementation is a viable approach to enhance the performance of protocol processing [Ito, 1993; Brau, 1993; Ruts, 1992]. Since protocol processing and memory bandwidth are major bottlenecks in high performance communication subsystems, the use of multiprocessor platforms is a desirable approach to increase their performance. However, the protocol functions must be adequately partitioned in order to utilize multiprocessor systems efficiently. There are different types and levels of parallelism that can be used to implement protocol functions. These types are typically classified according to the granularity of parallelism unit (Coarse grain, Medium grain, and Fine grain [Jain, 1990a]. A parallelism unit can comprise a complete stack, a protocol entity, or a protocol function. Three types of parallelism can be employed in processing protocol functions [Zitt, 1994]: 1) spatial parallelism which is 120 further divided into Single Instruction Multiple Data (SIMD)-like parallelism and Multiple Instruction Single Data (MISD)-like parallelism, 2) temporal parallelism, and 3) hybrid parallelism. Next, we describe the mechanisms used to implement each of these types of parallelism. SIMD-Like Parallelism In this type of parallelism, identical operations are concurrently applied to different data units (e.g., packets). Scheduling mechanisms (round-robin) may be employed to allocate packets to different processing units. An SIMD organization requires only a minimum synchronization among the processing units. However, it does not decrease the processing time for a single packet; it increases the number of packets processed during a certain time interval. Packets can be scheduled on a per-connection basis. In this case, parallelism takes place among different connections. Instead, packets can be scheduled on a per-packet basis, independent of their connection association. Synchronization overhead in the per-packet is more than that of per-connection scheduling. Jain et al [Jain, 1990b; Jain, 1992] have proposed a per-packet parallel implementation of protocols. ISO TP4 transport protocol has been implemented on a multi-processor architecture as shown in Figure 4.17. The main objective of this architecture is to be able to handle data transfer rates in the excess of gigabit per second. The Input and Output Low Level Processors (ILLP and OLLP) handle I/O, CRC, framing, packet transfer into and out of the memory, and low level protocol processing. The multiprocessor pool (MPP) handles the transport protocol processing. The HIP (see Figure 4.17) is a high speed DMA controller to transfer packets to and from host. L in e in IL L P L in e o u t OLLP P ro c e s s o r p o o l P1 P2 . . . . . Pn H IP H ost B us IL L P : In p u t L o w L e v e l P ro c e s s o r O L L P : O u tp u t L o w L e v e l P ro c e ss o r H IP : H o s t I n te rfa c e P ro c e s s o r Figure 4.17. Multiprocessor implementation of TP4 transport protocol. MISD-Like Parallelism In this type of parallelism, different tasks are applied concurrently on the same data unit. This type reduces the processing time required for a single packet since multiple tasks are processed concurrently. However, a high degree on synchronization among the 121 processing units may be required. A protocol can be decomposed into different protocol functions that can be applied concurrently on the same data unit. This is called perfunction parallelism. This type of parallelism incurs the highest synchronization overhead. A parallel implementation of ISO TP4 has been realized on a transputer network [Zitt, 1989]. A transputer is a 32-bit single chip microprocessor with four bidirectional serial links that are used to build transputer networks. Communication between processes is synchronized via the exchange of messages. The protocol is decomposed into a set of functions that are also decomposed into a send and a receive path. Protocol functions are then mapped to different transputers. Temporal Parallelism This type of parallelism operates on the layered model of the network architecture using pipeline approach as shown in Figure 4.18. To achieve pipelining, the processing task has to be subdivided into a sequence of subtasks each mapped on a different pipeline stage. Thereby, each pipeline stage processes different data at the same point in time. Pipelining does not decrease the processing time needed for a single packet. Since the performance of a pipeline is limited by its slowest stage, stage balancing is an important issue to increase the system throughput. The protocol stack can also be divided into a send and receive pipelines. Control information that describes the connection parameters (e.g., next expected acknowledgment and available window) is shared between the two pipelines. S end p ip eline layer 1 layer 2 ... layer n ... layer n shared data R eceive pipeline layer 1 layer 2 Figure 4.18. Temporal Parallelism based on pipeline approach. Hybrid Parallelism This type of parallelism corresponds to a mix of more than one parallelism type on the same architecture. For example, multiple processing units can process different packets concurrently (using connection level), each processing unit is a pipeline. 122 Summary The new technology made High Speed networks operate at terra bit per second (Tbps). This enabled the Distributed Interactive multimedia applications that involve transferring video, voice, and data within the same application with each media type has different quality of service requirements. In this chapter, we spotted the light on the proposed techniques to implement of standard transport protocols, introduce new adaptive protocols or implement the protocols by using a specialized hardware. So we discussed the main functions that must be performed by any transport protocol. In addition, we have identified the main problems that prevent the standard transport protocols (e.g., TCP, UDP) from exploiting high speed networks and the emerging parallel and distributed applications. We have also classified the techniques used to develop High Speed communication protocols that are broadly classified into two types: Hardware based and software based approaches. In the software approach, you can either improve existing protocols or introduce new protocols that can be static or dynamic. In the hardware approach, you can offload the protocol processing to either special hardware or to the host-network interface. High performance host network interface designs reduce the overhead associated with timers and data copying. Problems 1) Describe briefly the transport protocol function with specific details in its tasks and functions 2) What are the advantages and disadvantages of TCP/IP Protocol? 3) Briefly describe the problems of the current standard protocol and propose your own solutions? 4) What are the solutions proposed in the book to solve the current protocol’s problems? Discuss and compare with your answer in problems 4. 123 References [Jain90a] N. Jain, M. Schwartz and T. R. Bashkow, "Transport Protocol Processing at GBPS," Proceedings of the SIGCOMM Symposium on Communications Architecture and Prtocols, pp. 188-198, August 1990. [Tane88] A. S. Tanenbaum, Computer Networks, 2nd Edition, Prentice-Hall, 1988. [Port91] T. La Porta, M. Shwartz, "Architectures, Features, and Implementation of HighSpeed Transport Protocol," IEEE Network Magazine, May 1991. [Netr90] N. Netravali, W.D. Roome, and K. Sabnani," Design Implementation of HighSpeed Transport Protocol," IEEE Transaction on Communications, Nov. 1990. [Clar87] D. D. Clark, M. L. Lambert, L. Zhang, " NETBLT: A high Throughput Transport Protocol," Proceedings ACM SIGCOMM'87 Symp., vol. 17, no. 5, 1987. [Clar89] D.Clark, V.Jacobson, J.Romkey, and H. Salwen, “An Analysis of TCP Processing Overhead,” IEEE Communications Mahazine, pp.23-29, 1989 [Part90] C. Partridge, " How Slow is One Gigabit Per Second," in Symp.on Applied Computing, 1990. [Coul88] G. F. Coulouries, J. Dollimore, " Distributed Systems: Concepts and Design," Addison-Wesley Publishing Company Inc., 1988. [Doer90] A. Doeringer, D. Dykeman, M. Kaiserswerth, B. Meister, H. Rudin, and R. Williamson, “A Survey of Light-Weight Transport Protocols for High-Speed Networks, “ IEEE Transactions on Communications, Vol. 11, No. 11, November 1990, pp 2025-2038. [Bux88] W. Bux, P. Kermani, and W. Kleinoeder, “Performance of an improved data link control protocol,” Proceedings ICCC’88, October 1988. [Bank93] D. Banks, and M. Prudence, “A High-Performance Network Architecture for a PA-RISC Workstation”, IEEE Journal on Selected Areas in Commu. Vol 11, No.2, pp191-202, 1993 [Ito93] M. Ito et al “A Multiprocessor Approach for Meeting the Processing Requirement for OSI”, IEEE Journal on Selected Areas in Commun., Vol 11, No., 2, pp. 220-227, 1993 [Macl91] R. Maclean and S. Barvick, “An Outboard Processor for High Performance Implementation of Transport Layer Protocols,” Proceeding of GLOBECOM’91, pp.17281732 124 [Hutc91] N.Hutchinson and L.Peterson, “The x-Kernel: An Architecture for Implementation Networks Protocol,” IEEE Trans. On Software Eng., Vol. 17, No.,1, pp. 64-75, 1991 [Cher89] D.Cheriton and C. Williamson, “VMTP as the Transport Layer for Highperformance Distributed Systems,” IEEE Commun. Magazine, June 1989 [Cher86] D.Cheriton, “VMTP: A Protocol for the Next Generation of Communication Systems,” Proc. Of ACM SIGCOMM, 1986 [Wood89] C.M.Woodside, J.R.Montealegre, “The Effect of Buffering Strategies on Protocol Execution Performance,” IEEE Trans. On Communications, Vol.37, No.6, pp.545-553, 1989 [Davi94] B.Davie et al, “Host Interfaces for ATM Networks,” in High Performance Networks: Frontiers and Experience, Edited by A. Tantawy, Kluwer Academic Publishers, 1994 [Stee92] P.Steenkiste et al., “A Host Interface Architecture for High-Speed Networks,” Proc. IFIP Conf. On High-Performance Networks, 1992 [Rama92] K.K.Ramakrishnan, “Performance Issues in the Design of Network Interfaces for High-speed Networks,” Proc. IEEE Workshop on the Arch. and Implementation of High Performance Commun. Subsystems, 1992 [Jain90b] N.Jain et al., “Transport Protocol Processing at GBPS Rates,” Proc. ACM SIGCOMM, August 1990 [Kana88] H.Kanakia and D. Cheriton, “The VMP Network Adapter Board (NAB): Highperformance Network Communication for Multiprocessors,” Proc. ACM SIGCOMM, August 1988 [Brau93] T.Braun and C.Schmidt , “Implementation of a Parallel Transport Subsystem on a Multiprocessor Architecture,” Proc. High-performance Distributed Comp. Symp, July 1993 [Ruts92] E.Rutsche and M.Kaiserswerth, “TCP/IP on the Parallel Protocol Engine,” Proc. IFIP fourth International Conference on High Performance Networking, Dec. 1992 [Zitt94] M.Zitterbart, “Parallelism in Communication Subsystems,” in High Performance Networks: Frontiers and Experience, Edited by A.Tantawy, Kluwer Academic Publishers, 1994 [Jain92] N.Jain et al., “ A Parallel Processing Architecture for GBPS Throughput in Transport Protocols,” Proc. International Conference on Communication, 1992 125 [Zitt89] M.Zitterbart, “High-Speed Protocol Implementations based on a Multiprocessor Architecture,” in H.Rudin and R.Williamson(editors), Protocols for High-Speed Networks, Elsevier Science Publishers, 1989 [Clar90]D.Clark and D.Tennenhouse, “Architecture Considerations for a New Generation of Protocols,”, Proc. ACM SIGCOMM Symp., 1990 [Brau95] T.Braun and C. Diot, “Protocol Implementation Using Integrated Layer Processing,” Proc. ACM SIGCOMM, 1995 [Abbo93] M.Abbott and L.Peterson, “Increasing Network Throughput by Integrating Protocol Layers,” IEEE/ACM Trans. On Networking, Vol. 1, No. 5, pp.600-610, Oct. 1993 [Mall92] S.O’Malley and L. Peterson, “A Dynamic Network Architecture,” ACM Trans. Computer Systems, Vol. 10, No. 2, pp.110-143, May 1992 [Eyas96] Eyas Al-Hajery, “Application-oriented Communication Protocols for Highspeed Networks” Doctor Dissertation, Syracuse University, 1996 [Dosh92] B.T.Doshi and P.K.Johri, “Communication Protocols for High Speed Packet Networks,” Computer Networks and ISDN Systems, No.24, pp.243-273, 1992 [Hosc93] P. Hoschka, “Towards Tailoring Protocols to Application Specific Requirements,” INFORCOM’93, pp.647-653, 1993 [Hutc89] N.C.Hutchinson, et al, “Tools for Implementing Network Protocols,” SoftwarePractice and Experience, 19(9), pp.895-916, 1989 [Wils91] W.Wilson Ho. Dld. “A Dynamic link/Unlink Editor,” Version 3.2.3, 1991. Available by FTPing /pub/gnu/did-3.2.3.tar.Z at metro.ucc.su.oz.au [Prot89] “XTP Protocol Definition Revision 3.4”, Protocol Engines, Incorporated, 1900 State Street, Suite D, Santa Barbare, Califorrnia, 1989 [Ches87a] G. Chesson, “The Protocol Entine Project,” UNIX Review, Vol.5, No. 9, Sept. 1987 [Ches87b] G. Chesson, “Protocol Engine Design,” USENIX Conference Proceedings, Phoenix, Arizona, June 1987 [DENA 01] Spryros Denazis, et al, Designing IP Router L-Interfaces, IP Subworking Group IEEE P1520, Document No. P1520/TS/IP-005. 126 [CAMP 01] Andrew Campbell, et al, A Survey of Programmable Networks, COMET Group, Columbia University. [CLAV 98] K. Clavert, et al, Architectural Framework for Active Networks, Active Networks Working Group Draft, July 1998. 127 High Performance Distributed Computing Chapter 5 Distributed Operating Systems Objective of this Chapter The main objective of this chapter is to review the main design issues of distributed operating systems and discuss representative distributed operating systems as case studies. Further detailed description and analysis of distributed operating system designs and issues can be found in textbooks that focus on distributed operating systems [Tane95, Sing94,Chow97]. 5.1 Introduction Operating system is a system program that acts as an intermediary between a user of computing system and the computer hardware. It manages and protects the resources of the computing system while presenting the user with a friendly user interface. Many operating systems have been introduced in the past few decades. They can be grouped into single user, multi-user, network, and distributed operating systems. In a single-user environment, the operating system runs on a single computer with one user. In this case, all the computer resources are allocated to one user (e.g., MSDOS/Windows). The main functions include handling interrupts to and from hardware, I/O management, management of memory, file and naming services. In a multi-user environment, the users are most likely connected through terminals. The operating system in this environment is quite a bit more complex. In addition to all the tasks associated with a single user operating system, the operating system is responsible for the scheduling of processes from different users in a fair manner, Inter-Process Communication (IPC) and resource sharing. A Network Operating System (NOS) is one that resides on a number of interconnected computers and allows users to easily communicate with any other computer or user in the system. The important factor here is that the user is aware of what machine he or she is working on. The network operating system can be viewed as an additional system software to support processing user requests and applications to run on remote machines with different operating systems. A Distributed Operating System (DOS) runs on a collection of computers that are interconnected by a computer network and it looks to its users like an ordinary centralized operating system [tane85]. The main focus here is on providing as many types of transparencies (access, location, name, control, data, execution, migration, performance, and fault tolerance) as possible. In a distributed operating system environment, the operating system gives the illusion of having a single time shared computing system. Achieving the single image system feature is extremely difficult because of many factors such as the lack of global clock and a system-wide state of the resources, the geographic distribution of resources, the asynchronous and autonomous interactions among resources, just to name a few. 124 High Performance Distributed Computing What is a Network Operating System? In a network operating system environment, as defined previously, the users are aware of the existence of multiple computers, and can log in to remote machines and copy files from one machine to another. Each computer runs its own local operating system and has its own user [tane87]. Basically, the NOS is a traditional operating system with some kind of network interface and some utility programs to achieve remote login and remote file access. The main functions are provided by the local operating systems, and the NOS is called by the local operating system to access the network and its resources. The major problem with the NOS approach is that it is "force fit over the local environment and provide access transparency to many varied local services and resources [fort85]. The main functions to be provided by a NOS are [fort85]: • Access to network resources • Provide proper security means for the system resources • Provide some kind of network transparency • Control the costs of network usage • A realizable service to all users To implement such functions or capabilities, the NOS provides a set of low level primitives that can be classified into four types: • User communication primitives that commonly referred to as mail facilities to enable communications between users and/or processes • Job migration primitives that allow processes or loads to be distributed in order to balance system load and improve performance • Data migration primitives that provide the capability to move data and programs reliably from one machine to another • System control primitives that provide the top level of control for the entire network system. It is responsible for knowing the configuration of the network as well as reconfiguration and re-initialization of the entire system in the event of a failure. What is Distributed Operating Systems (DOS)? The basic goal of a distributed operating system is to provide full transparency by hiding the fact that there are multiple computers and thus to unify different computers into a single integrated computing and storage environment [dasg91]. This means that DOS must hide the distribution aspect of the system from users and programmers. Designing a DOS that provides all forms of transparency (access, location, concurrency, replication, failure, migration, performance, and scaling transparency) is a challenging research problem and most current DOS systems support a subset of these transparency forms. The main property that distinguishes a distributed system from a traditional centralized system is the lack of up-to-date global state and clock. It is not possible to have instant information about the global state of the system because of the lack of a physical shared memory, where the state of the system can be represented. Thus a new dimension is added to the design of operating system algorithms that must address obtaining global 125 High Performance Distributed Computing state information using only local information and approximate information about the states of remote computers. Because of these difficulties, most of the operating systems that control current distributed systems are just extensions of conventional operating systems to allow transparent file access and sharing (e.g., SUN Network File System (NFS)). Because these solutions are just short term, a more elegant design is preferable to fully exploit all the potential benefits of distributed computing systems. A distributed operating system when it is built from scratch does not encounter the problems of force-fitting itself to an existing system as it is usually done in NOS. The general structure of a DOS is shown in Figure 5.1. DOS distributes its software evenly on all the computers in the system and its responsibilities can be outlined as follows [fort85]: • • • • • • • • • • • • Global Inter-Process Communication (IPC) Global resource monitoring and allocation Global job or process scheduling Global command language Global error detection and recovery Global deadlock detection Global and Local memory management Global and local I/O management Global access control and authentication Global debugging tools Global synchronization primitives Global exception handling Distributed Operating System global user command management local user command management global memory management local memory management global CPU management local CPU management global file management local file management one to many Figure 5.1. A general structure of a DOS. 126 global I/O management local I/O management High Performance Distributed Computing DOS v.s. NOS: The network operating system can be viewed as an extension of an existing local operating system, where a distributed operating system appears to its users as a traditional uniprocessor time shared operating system, even though it is actually composed of multiple computers [tane92]. The transparency issue in DOS is the main feature that creates the illusion of being a centralized time-shared computing system. Furthermore, DOS is more complex than the NOS; a DOS requires more than just adding a little code to a regular operating system. One of the factors that increases the complexity of a DOS is the parallel execution of programs on multiple processors. This greatly increases the complexity of the operating system's process scheduling algorithms. 5.2 Distributed Operating System Models There are two main models to develop a DOS: object-based and message-based (some authors refer to this model as process-based) models. The object-based model views the entire system and its resources in term of objects. These objects consist of a type representation and a set of methods, or operations that can be performed on these objects. An active object is a process and a passive object is referred to as data. In this environment, for processes to perform operations in the system, they must have capabilities for the objects to be invoked. The management of these capabilities is the responsibility of the DOS. For example, a file can be viewed as an object with methods operating on that file such as read and write operations. For a process to use this object, it must own or have the capabilities to access that object and run some or all of its methods. The object-based distributed operating systems have been built either on top of an existing operating system or built from scratch. Message-based model relies as its name implies on the Inter-Process Communication (IPC) protocols to implement all the functions provided by the DOS. Within each node in the system, a message passing kernel is placed to support both local and remote communications. Messages between processes are used to synchronize and control all their activities [fort85]. The system services are requested via message passing whereas in traditional non-DOS systems like UNIX, this is done via procedure calls. For example, these messages can be used to invoke semaphores to implement mutual exclusion synchronization. Message-based operating systems are attractive because the policy used to implement the main distributed operating system services (e.g. memory management, file service, CPU management, I/O management, etc.) are independent of the InterProcess Communication mechanism[dasg91]. 5.3 Distributed Operating System Implementation Structure The implementation structure of an operating system indicates how the various components of any DOS are organized. The structures used to implement DOS include monolithic kernel, micro-kernel and object-oriented approach. 5.3.1 Monolithic Structure 127 High Performance Distributed Computing In this structure, the kernel is one large module that contains all the services that are offered by the operating system as shown in Figure 5.2. The services are requested by storing specific parameters in well-defined areas, e.g. in registers or in the stack, and then executing a special trap instruction. This allows the control to be transferred from the user space to the operating system space. The service is then executed in the operating system mode until completion and after that the control is returned back to the user process. This kind of structuring is not well suited for the design of distributed operating systems since there are a wide range of computer types and resources (e.g. diskless workstations, compute servers, multi-processor systems, file servers, name servers, database servers); in such an environment, each computer or resource is suited for some special task and thus it is not efficient to load the same operating system on each computer or resource; a print server will utilize heavily the DOS services related to files and I/O printing functions. Consequently, loading the entire DOS software to each computer or resource in the system will lead to wasting a critical resource (e.g. memory) unnecessarily and thus degrading the overall performance of the system. Dynamically loaded server program … ... S4 Server Monolithic Kernal S1 S2 S3 Kernel code and data ... ... Figure 5.2. A Monolithic Operating System Structure 5.3.2 Microkernel Structure In this approach, the system is structured as a collection of processes that are largely independent of each other as shown in Figure 5.3. The heart of the operating system, the micro-kernel, resides on each node and provides the essential functions or services required by all other services or applications. The main micro-kernel services or functions include communication, process management, and memory management. Thus the micro-kernel software can be made small, efficient, and having lesser number of errors as compared to a monolithic kernel. The micro-kernel structure supports the design technique that aims at separating the operating system policy and mechanisms. The operating system policy can be changed independently, without affecting the underlying mechanism (e.g., the policy of granting an entity rights to a user might change depending on the needs, whereas the underlying mechanism of implementing access control will remain unaffected. Microkernel Dynamically loaded server program S1 Server S2 S3 S4 ... ... Kernel code and data 128 High Performance Distributed Computing Figure 5.3. A Micro-kernel Operating System Structure .5.3.3 Object Oriented Structure In this structure, the services and functions are implemented as objects rather than being implemented as independent processes as it is in the micro-kernel approach. Each object encapsulates a data structure and defines a set of operations that can be carried out on that data structure. Each object is given a type that defines the properties of the object. The encapsulated data structure that can be accessed and modified by performing defined operations, referred to as methods, on that object. As in the micro-kernel approach, the services can be built and maintained independently of other services. It supports the separation of policies from the implementation mechanisms. 5.4 Distributed Operating System Design Issues A Distributed Operating System designer must address additional issues such as interprocess communications, resource management, naming service, file service, protection and security, and fault tolerance. In this section, we discuss the issues and options available in designing these services. 5.4.1 Inter-Process Communications (IPC) In a distributed system environment processes need to communicate without the use of shared memory. Message passing and Remote Procedure Call (RPC) are widely used to provide Inter-Process Communication (IPC) services in distributed systems. Message Passing The typical implementation of the message passing model in distributed operating systems is the client-server model [tane85]. In this model, the requesting client specifies the desired server or service and initiates the transmission of the request across the network. The server receives the message, responds with the status of the request, and provides an appropriate buffer to accept the request message. The server then performs the requested service and returns the results in a reply message. Three fundamental design issues must be addressed in designing an IPC scheme: 1) are requests blocking or non-blocking 2) is the communication reliable or unreliable 3) are messages buffered or unbuffered These issues are handled by system primitives. The selection of primitives depends mainly on the types of transparencies to be supported by the distributed system and the targeted applications. In non-blocking send and receive primitives, the send primitive will return control to the initiating process once the request has been queued in an outgoing message queue. When the send message has been transmitted, the sending process is interrupted. This indicates that the buffer used to store the send is now available. At the receive side the receiving process indicates it is ready to receive the message and then goes to sleep. It then 129 High Performance Distributed Computing suspends processing until awakened by the arriving message. This method is more flexible than the blocking method. The message traffic is asynchronous and messages may be queued for a long time. However, this complicates programming and debugging. If a problem occurs, the asynchronous behavior makes the process of locating and fixing the problem very difficult and time consuming. For this reason, many designers use blocking send and receive primitives to implement IPC in distributed operating systems. Blocking signals return control to the sending process only after the message is sent. Unreliable blocking occurs if control is returned to the sending process immediately after the message is sent, providing no guarantee that the request message has reached the destination successfully. Reliable blocking occurs when control is returned to the sending process only after receiving an acknowledgement message from the receiving process. Similar to the blocking send, the blocking receive does not return control until the message has been received and placed in the receive buffer. If a reliable receive is implemented, the receiving process will automatically acknowledge the receipt of the message. Most distributed operating systems support both blocking and non-blocking send and receive primitives. The common method is to use a non-blocking send with a blocking receive. In this case, you assume the transmission of the message is reliable and there is no need to block the sending process once the message is copied into the transmitter buffer. The receiver, on the other hand, needs to get the message to perform its work [chow97]. However, there are situations where the receiver is handling requests from several senders. It is therefore desirable to use non-blocking receive in such cases. Another important design issue is whether or not to provide message buffering. If no buffering of messages is used, both the sender and the receiver must be synchronized before the message is sent. This approach requires establishing the connection explicitly in order to eliminate the need to buffer the message and thus the message size can be large. When a buffer is used, the operating system allows the sender to store the message in the kernel buffer. Using non-blocking sends, it is possible to have multiple outstanding send messages. In general it is a good practice to minimize the number of communication primitives allowed in the distributed operating system design. Most distributed operating systems adopt a request/response communication model that requires only three communication primitives (e.g., Ameoba). The first primitive is used to send a request to a server and to wait on the reply from the server. The second primitive is to receive the requests that are sent by the clients. The third primitive is used by the server to send the reply message to the client after the request has been processed. Remote Procedure Call The use of remote procedure calls for inter-process communication is widely accepted. The use of procedures to transfer control from one procedure to another is simple and well understood in high-level language programming. This mechanism has been extended to be used in distributed operating systems by providing that the called procedure will, in 130 High Performance Distributed Computing most cases, activate and execute on another computer. The transition from a local process to a remote procedure is performed transparently to the user and is the responsibility of the operating system. Although this method appears simple, a few important issues must be addressed. The first one is the passing of parameters. Parameters could be passed either by value or reference. Passing parameters by value is easy and involves including the parameters in the RPC message. Passing parameters by reference is more complicated as we need unique global pointers to locate the RPC parameters. Passing parameters is even more complicated if heterogeneous computers are used as each system may use different data representations. One solution is to convert the parameters and data to a standard data representation (XDR) before the message is sent. Generally speaking, the main design issues of an RPC facility can be summarized in the following points [Pata91]: • • • • • • RPC Semantics: This defines precisely the semantics of a call in the presence of computer and communication failures Parameter Passing: This describes the techniques used to pass parameters between caller and callee procedures. This defines the semantics of address-containing arguments in the absence of shared address space (if passing parameters by reference is used). Language Integration: This defines how the remote procedure calls will be integrated into existing or future programming systems. Binding: This defines how a caller procedure determines the location and the identity of the callee. Transfer Protocol: This defines the type of communication protocol to be used to transfer control and data between the caller and the callee procedures. Security and Integrity: This defines how the RPC facility will provide data integrity and security in an open communication network. 5.4.2 Naming Service Distributed system components include a wide range of objects or resources (logical and physical) such as processes, files, directories, mailboxes, processors, and I/O devices. Each of these objects or resources must have a unique name that can be presented to the operating system when that object or resource is needed. [Tane85]. The naming service can be viewed as a mapping function between two domains that can be completed in one or more steps. This mapping function does not have to be unique. For example, if a user requests a service and several servers can perform the given service, then each server might possess the same name. Naming is handled in a centralized operating system by maintaining a table or a database that provides the name to object conversion or mapping. Distributed operating systems may implement the naming service using either a centralized, hierarchical or distributed approach. Centralized Name Server: In this approach, a single name server accepts names in one domain and maps them to another name understood by the system. In a UNIX environment, this represents the mapping of an ASCII file name into its I-node number. A server or a process must first register its name in the database to publicly advertise the availability of the service or process. This approach is simple and effective for small systems. If the system is very large either in the number of available objects or in the 131 High Performance Distributed Computing physical distance covered then the single system database is not sufficient. This approach is not practical because it allows single-point failure, making the whole system unavailable when the name server crashes. Hierarchical Name Server: Another method of implementing the name server is to divide the system into several logical domains. Logical domains maintain their own mapping table. The name servers can be organized using a global hierarchical naming tree. This approach is similar to the mapping technique used in a telephone network in which a country code is followed by an area code followed by an exchange code all preceding the users phone number. An object can be located or found by locating which domain or sub-domain it resides in. The domain or sub-domain will perform the local mapping and locate the requested object. Another method is to locate objects by using a set of pairs that will point to either the physical location of the object or the next name that might contain the physical location of the object. In this manner, several mapping tables might be searched to locate an object. Distributed Name Server: A final method that may be used to implement the name server is to allow each computer or resource in the distributed system to implement the name service function. Specifically, each machine on the network will be responsible for managing its own names. Whenever a given object is to be found, the machine requesting a name will broadcast that name on the network if it is not in its mapping table. Each machine on the network will then search its local mapping table. If a match exists, a reply is sent to the requesting machine. If no match is found, no reply is sent by that machine. 5.4.3 Resource Management Resource management concerns making both local and remote resources available to a user in an efficient and transparent manner. The location of the resources is hidden from the user and remote and local resources are accessed in the same manner. The major difference in resource management in a distributed system as compared to that of a centralized system is that global state information is not readily available and is difficult to obtain and maintain [tane85]. In a centralized computing system, the scheduler has full knowledge of the processes running in the system and global system status is stored in a large central database. The resource manager can optimize process scheduling to meet the system objective. In a distributed system, there is no centralized table or global state. If it did exist, it would be difficult or impossible to maintain, because of the lack of accurate system status and load. Process scheduling in distributed systems is a challenging task compared to that of a centralized computing system. In spite of the resource management difficulties in distributed systems, a resource manager of some type must exist to allocate processors to users, schedule processes to processors, balance the load of the system processors, and detect deadlock situations whenever they occur in the system. The main tasks of resource management include processor allocation, process scheduling, load balancing, and scheduling. Processor Allocation 132 High Performance Distributed Computing The processing units are the most important resource to be managed. In a processor pool model a large number of processors are available for use by any process. When processes are submitted to the pool, the following information is required: process type, CPU and I/O requirements, and memory requirement. The resource manager then determines the number of processors to be allocated to the process. One can organize the system processors into a logical hierarchy that consists of multiple masters, each master controlling a cluster of worker processors [tane85]. In this hierarchy, the master keeps track of how many of its processors are busy. When a job is submitted, the masters may collectively decide how the process may be split up among the workers. When the number of masters becomes large, a node or a committee of nodes is designated to oversee the processor allocation for the entire system. This eliminates the possibility of a single point failure. When a master fails, one of the worker processors is designated or promoted to perform the responsibilities of the failed master. Using a logical hierarchy, the next step is to schedule the processes among the processors. Scheduling Process scheduling is important in multiprogramming environment to maximize the utilization of system resources and improve performance. Process scheduling is extremely important in distributed systems because the possibility of idle processors. Distributed applications that consist of several concurrent tasks need to interact and synchronize their activities during execution. In such cases, the process or task scheduler needs to allocate the concurrent tasks to run on different processors at the same time. For example, consider tasks T1 and T2 (Figure 5.1) that communicate with one another and are loaded on separate processors. Machine 1 Processor 1 Processor 2 2 Time T 2T 3T 4T 5T T1 X X T2 X X T1 T2 X X Figure 5.1 An Example of Process scheduling In this example, T1 and T2 begin their execution at time slots, T, and 2T, respectively. In this case, Task T2 will not receive the message sent by Task T1 during time-slice T until the next time-slice 2T when it is begins execution. The best timing for the entire requestreply scenario in this case would be 2T. If however, these two tasks are scheduled to run during time slice 4T on both processors, the best timing for the request-reply scenario is one T instead of 2T. This type of process scheduling across the distributed system processors that take into consideration the dependencies among processes is a challenging research problem. Another approach for process scheduling is to group processes based on their communications requirements. The scheduling algorithm will then ensure that processes 133 High Performance Distributed Computing belonging to one group are context switched and executed simultaneously. This requires efficient synchronization mechanisms to notify the processors when tasks need to be context switched. Load Balancing and Load Scheduling Load balancing improves the performance of a distributed system by moving data, computation, or processes so that heavily loaded processors can offload some of their work to lightly loaded ones. Load balancing has a more stringent requirement than load scheduling because it strives to keep the loads on all computers roughly the same (balance the load). We will not distinguish between these two terms in our analysis of this issue. Load balancing seems intuitive and is almost taken for granted when first designing a distributed system. However, load balancing and process migration incurs high overhead and should be done only if a performance gain is larger than the required communication overhead. Load balancing techniques can be achieved by migrating data, computation, or entire processes. Data Migration: In data migration, the distributed operating system brings the required data to the computation site. The data in question may either be the contents of a file or a portion of the physical memory. If file access is requested, the distributed file system is called to transfer the contents of a file. If the request is for contents of the physical memory, the data is moved using either message passing or distributed shared memory primitives. Computation Migration: In computation migration, the computation migrates to another location. A good example on computation migration is the RPC mechanism, where the computations of the requested procedure are performed on a remote computer. There are several cases in which computation migration become more efficient or provide more security. For example, a routine to find the size of a file should be executed at the computer where the file is stored rather than transferring the file to the executing site. Similarly, it is safer to execute routines that manipulate critical system data structures at the site where these data structures reside rather than transmitting them over a network where an intruder may tap the message. Process Migration: In process migration, entire processes are transferred from one computer and executed on another computer. Process migration leads to better utilization of resources especially when the process is moved from a heavily loaded computer to a lightly loaded one. A process may also be relocated due to the unavailability of a component critical for its execution (e.g. a math coprocessor). 5.4.4 File Service The file system is the part of the operating system that is responsible for storing and retrieving stored data. The file system is decomposed into three important functions: disk service, file service, and directory service [tane85]. The file system characteristics depend heavily on how these functions are implemented. One extreme approach is to implement all the functions as one program running on one computer. In this case, the 134 High Performance Distributed Computing file system is efficient but inflexible. The other extreme is to implement each function independently so we can support different disk and file types. However, this approach is inefficient because these modules communicate with each other using inter-process communication services. A common file system for a distributed system is typically provided by file servers. File servers are high performance machines with high storage capacity that offer file system service to other machines (e.g., diskless workstations). The advantages of using common file servers include lower system costs (one server can serve many computers), data sharing, and simplified administration of the file system. The file service can be distributed across several servers to provide a distributed file service (DFS). Ideally, DFS should look to users as a conventional unified file system. The multiplicity of data and the geographic dispersion of data should be transparent to the users. However, DFS introduces new problems that distributed operating systems must address such as concurrent access, transparency, and file service availability. Concurrency control algorithms aim at making parallel access to a shared file system equivalent to a sequential access of that file system. Most of concurrency control algorithms used in database research have also been proposed to solve the concurrency problem in file systems. Various degrees of transparency such as location and access transparency are desirable in a DFS. By providing location transparency we hide the physical location of the file from the user. The file name should be location independent. Additionally, users should be able to access remote files using the same set of operations that are used to access local files (access transparency). Other important issues that a distributed file system should address include performance, availability, fault tolerance, and security. 5.4.5 Fault Tolerance Distributed systems are potentially more fault tolerant than a non-distributed system because of inherent redundancy in resources (e.g., processors, I/O systems, network devices). Fault tolerance enables a computing system to successfully continue operations in-spite of system component failures (hardware or software resources). Fault intolerant systems crash if any system failure occurs. There are two approaches of making a distributed system fault tolerance. One is based on redundancy and the second is based on atomic transaction [tane85]. Redundancy Techniques: Redundancy is in general the most widely used technique to implement fault-tolerance. The inherent redundancy in a distributed system makes it potentially more fault tolerant than a non-distributed system. Fault tolerance techniques involve detecting fault(s) once they occur, locating and isolating the faulty components, and then recovering from the fault(s). In general, fault detection and recovery are the most important and difficult to achieve [anan91]. System failures may either be hardware or software errors. Hardware errors are usually Boolean in nature and cause software using this hardware to stop working. Software errors consist of programming errors and specification errors. Specification errors occur 135 High Performance Distributed Computing when a software system successfully meets its specification but fails because the program was incorrectly specified. Programming errors occur when the program fails its specification because of human errors in the programming or in loading that program. Most research that addresses software errors has focused on programming errors because specification errors are difficult to characterize and detect [need more references ]. *** any time we make some claims about properties, functionalities, it would be nice to bring some references to that part ***** One method to tolerate hardware failures is to replicate process execution on more than one processor. Consequently, the hardware failure of one processor does not eliminate the functionality of the process because its mirror process will continue its execution on a fault-free redundant processor. If replicated processes signal the failure event of one process, the system will then allocate the failed process to another processor in order to maintain the same level of fault tolerance. However, pure process redundancy can not tolerate programming errors. If the process fails due to a programming error, the other replicated processes will produce the same error. Version redundancy has been proposed to eliminate programming errors [*** need references here ***]. Typically, version redundancy is used in applications in which the operation of the system is critical and debugging and maintenance is not possible. For example, in deep space exploration crafts the system is supposed to work without error for long periods of time. Formal specifications are given to more than one programmer who independently develops code to satisfy the formal specification of the application. During application execution the redundant versions of the application run concurrently and voting is used to mask any errors. If one version experiences a programming error it is very unlikely that the other versions will have the same error. Therefore, the programming error is detected and tolerated. Atomic Transactions: An atomic transaction is one that runs to completion or not at all. If an atomic transaction fails, the system is restored to its initial state. To achieve atomic operations the system must rely on reliable services such as careful read and write operations, stable storage, and stable processor [tane85]. **** can you please identify tane85 reference, we have been using it heavily in this chapter and I want to check the source and see why that is case. If it is heavily dependendant on one source, I might need to change drastically this chapter. The old version of this chapter has probably all the references. Because of the many changes made, we lost some of the references ****** Careful Disk Operations: At the disk level, the common operations of WRITE and READ simply store a new block of data and retrieve a previously stored block of data respectively. Built on these common primitives, a data abstraction of CAREFUL_WRITE and CAREFUL_RREAD can be implemented. When a CAREFUL_WRITE operation calls the WRITE service of the disk, a block of data is written to the disk and then a READ service is called to immediately read back the written block to ensure it was not written to a bad sector on the disk. If, after a predetermined number or attempts, the READ continues to fail the disk block is declared bad. Even after writing correctly to the disk, a block can go bad. This can be detected by 136 High Performance Distributed Computing checking the parity check field of any read block during the CAREFUL_READ operation. Stable Storage: On top of the CAREFUL_WRITE and CAREFUL_READ abstractions, the idea of stable storage can be implemented. The stable storage operations mirror all data on more than one disk so as to minimize the amount of data lost in the event of a disk failure. When an application attempts to write to memory, the stable storage abstraction first attempts a CAREFUL_WRITE to what is known as the primary disk. If the operation completes without error, a CAREFUL_WRITE is attempted on the secondary disk. In the event of a disk crash the blocks on both the primary and secondary disks are compared. If the corresponding disk blocks are the same and GOOD, nothing further needs to be done with these blocks. On the other hand, if one is BAD and the other is GOOD, the BAD block is replaced by the data from the GOOD block on the other disk. If the disk blocks are both GOOD but the data is not identical, the data from the primary disk is written over the data from the secondary disk. The reason for the latter is that the crash must have occurred between a CAREFUL_WRITE to the primary disk and a CAREFUL_WRITE to the secondary disk. Stable Processor: Using stable storage, processes may checkpoint themselves periodically. In the event of processor failure, the process running on the faulty processor can restore its last check-point state from stable storage. Given the existence of stable storage and fault tolerant processors, atomic transactions can be implemented. When a process wishes to make changes to a shared database, the changes are recorded in stable storage on an intention list. When all of the changes have been made, the process issues a commit request. The intention list is then written to memory. Using this method, all the intentions of a process are stored in stable storage and thus may be recovered by simply examining the contents of outstanding intention lists in stable storage when a processor is brought on line. 5.4.6 Security and Protection Service The distributed operating system is responsible for the security and the protection of the overall system and its resources. Security and protection are necessary to avoid unintentional or malicious attempts to harm the integrity of the system. A wild pointer in a program may unintentionally overwrite part of a critical data structure. Any system may face a threat from a misguided user attempting to break into the system. Two issues that must be dealt with in the design of security measures for a system are authentication and authorization. Authentication: Authentication is making sure that an entity is what it claims to be. For example, a person knowing another one's password on the system can log in as that person. The operating system has no way of knowing whether the right person or an imposter has logged in. Password protection is sufficient in general, but high security systems might resort to physical identification, voice identification, cross examination, or user profiling to authenticate a user. Authentication is especially important in distributed systems as a person may tap the network and pose as a client or a server. Encryption of transmitted data is a technique used to deter such attempts. 137 High Performance Distributed Computing Authorization: On the other hand, authorization is granting a process the right to perform an action or access a resource based on the privileges it has. Privileges of a user may be expressed either as an Access Control List (ACLs) or as a Capabilities List (Clist). Only a process having an access right or a capability for an object may access that object. The ACLs and C-lists are protected against illegal tampering using a combination of hardware and software techniques. In object based distributed operating systems, the object name can be in the form of a capability. The capability is a data structure that uniquely identifies an object in the system as well as the object manager, the set of operations that can be performed on that object, and provide the required information to control and protect the access to that object. 5.4.7 Other Services Various other services have been implemented by distributed operating systems such as time service, gateway service, print service, mail service, and boot service just to name a few. New services can also be considered depending on the user requirements and system configuration. 5.5 Distributed Operating System Case Studies In general, distributed operating systems can be classified into two categories: message passing and object-oriented distributed operating systems. Examples of message passingdistributed systems include LOCUS, MACH, V, and Sprite. Examples of object-oriented distributed systems include Amoeba, Clouds, Alpha, Eden, X-kernel, WebOS, 2K and the WOS. The operating systems to be discussed in this section include LOCUS, Amoeba, V, MACH, X-kernel, WebOS, 2K and WOS. These distributed operating systems were selected as a cross-section of current distributed operating systems with certain interesting implementation aspects. Each distributed operating system is briefly overviewed by highlighting its goals, advantages, system description, and implementation. Additionally, we discuss some of the design issues that characterize these distributed operating systems. 5.5.1 LOCUS Distributed Operating System Goals and Advantages The Locus distributed operating system was developed at UCLA in the early 1980’s [walk83]. Locus was developed to provide a distributed and highly reliable version of a Unix compatible operating system [borg92]. This system supports a high degree of network transparency, namely location, concurrency, transparent file replication system, and exhaustive fault tolerance system. The main advantages of Locus are its reliability and the support of automatic replication of stored data. This degree of replication is determined dynamically by the user. Even when the network is partitioned, the Locus system remains operational because of the use of a majority consensus system. Each 138 High Performance Distributed Computing partitioned sub-network maintains connectivity and consistency among its members using special protocols (partition, merge, and synchronization protocols). System Description Overview Locus is a message-based distributed operating system implemented as a monolithic operating system. Locus is a modification and extension of the Unix operating system and load modules of Unix systems are executable without recompiling [borg92]. The communication between Locus kernels is based on point-to-point connections that are implemented using special communication protocols. Locus supports a single global virtual address space. File replication is achieved using three physical containers for each logical file group and replica access is controlled using a synchronization protocol. Locus supports remote execution and allows pipes and signal primitives to work across a network. Locus provides robust fault tolerance capabilities that enable the system to continue operation during network or node failures and has algorithms to make the system consistent by merging the states from different partitions. Resource Management LOCUS provides transparent support for remote processes by providing facilities to create a process on a remote computer, initialize the process appropriately, support execution of processes on a cluster of computers, support the inter-process communication as if it were on a single computer, and support error handling. The site selection mechanism allows for executing software as either a local or remote site with no software change required. The user of a process determines its execution location through information associated with the calling process or shell commands. The Unix commands fork and exec are used to implement local and remote processes. For increased performance a run call has been added which has the same effect as a fork followed by an exec. Run avoids the need to copy the parent process image and includes parameterization to allow the user to setup the environment for the new process, regardless whether it is local or remote. In Unix, pipes and signal mechanisms rely on shared memory for their implementations. In order for Locus to maintain the semantics of these primitives when they are implemented and run on networked computers, it is required to support shared memory. Locus implements the shared memory mechanism by using tokens. File System The Locus file system presents a fully transparent single naming hierarchy to both users and applications. The Locus file system is viewed as a superset of the Unix file system. It has extended the Unix system in three areas. First, the single tree structure covers all the system objects on all the computers in the system. This means that you can not determine the location of a file from its name. Location transparency allows data and programs to be accessed or executed from any location in the system with the same set of commands regardless whether they are local or remote. Second, Locus supports file replication in a transparent manner, The Locus system is responsible for keeping all 139 High Performance Distributed Computing copies up to date, assuring all requests are served by the most recent available version, and supporting operations on partitioned networks. Third, the Locus file system must support different forms of error and failure management. One unique feature in Locus file system is its support of file replication and mechanisms provided to achieve transparent access, read, and write operations on the replicated files. The need for file replication stems from several sources. First, it improves the user access performance. This is most evident when reading files. However, updates also show improvement even with the associated cost of maintaining consistency across the network. Second, the existence of a local copy significantly improves file access performance as remote file access is expensive even on high speed networks. Furthermore, replication is essential in supporting various characteristics of a distributed system such as fault tolerance and system data structures. However, file replication does come with a cost. When a file is updated and many copies of that file exist in the system, the system must make ensure that all local copies are current. Additionally, the system must determine which copy to use when that file needs to be accessed. If the replicated files are not consistent due to network partitions and hardware failures, a version number is used to make ensure a process accesses the most up-to-date copy of a file. A good data structure to be replicated is the local file directory since it has a high ratio of read to write accesses. Locus defines three types that a site may take during file access ( Figure 5.2): • Using Site (US). This is the site from which the file access originates and to which the file pages are sent. • Storage Site (SS). This is the site at which a copy of the requested file is stored. • Current Synchronization Site (CSS). This site forces a global synchronization policy for the file and selects SS’s for each open request. The CSS must know which sites store the file and what the most current version of the file is. The CSS is determined by examining the logical mount table. Any individual site can operate in any combination of these logical sites. Open(1) US CSS Response(4) Response(3) Be SS?(2) SS Figure 5.2 LOCUS Logical File Sites. File access is achieved by several primary calls: open, create, read, write, commit, close, and unlink. The following sequence of operations is required to access any file: 1. US makes an OPEN request to the CSS site 2. CSS makes a request for a storage site SS 140 High Performance Distributed Computing 3. SS site responds to the CSS request message 4. CSS selects a storage site (SS) and then informs the using site (US) After a file has been opened, the user can read the file by issuing a read call. These read requests are serviced using kernel buffers and are distinguished as either local or remote. In the case of a local file access, the requested data from external storage is stored in the operating system buffer and then copied into the address space of the user buffer. If the request was for a remote file, the operating system at the using site (US) allocates a buffer and queues a remote request to be sent over the network to the SS. A network read has the following sequence: 1. US requests a page in a file 2. SS responds by providing the requested page The close system call uses an opposite sequence of operations. The write request follows the same steps taken for open request once we replace the read requests with write requests to the selected storage site. It is important to note that file modifications are done atomically using a shadowing technique. That is, the operating system always possesses a complete copy of the original file or a completely changed file, never a partially modified one. Fault Tolerance Locus supports fault tolerance operations based on redundancy. Error detection and handling are integrated with a process description data structure. When one of the interacting processes experiences a failure (e.g., child and parent processes), an error signal is generated to notify the other fault-free process to modify its process data structure. The second important issue that must be addressed is recovery. The basic recovery strategy in Locus is to keep all the copies of a file within a single partition current and consistent. Later, these partitions are merged using merge protocols. Conflicts among partitions are resolved through automatic or manual means. The Locus system supports reconfiguration. This enables the system to reconfigure the network and system resources as necessary in response to faults. The reconfiguration protocols used in Locus assume a fully connected network. Thus for reconfiguration Locus utilizes a partition protocol and a merge protocol. Partitions within Locus are an equivalence class requiring all members to agree on the state of the network. Locus uses an iterative intersection process to find the maximum partition. As partitions are established a merge is executed to join several partitions. Other Services The Locus directory system is used to implement the name service, interprocess communication, and remote device access. Locus has no added security and protection features other than those supported in Unix operating system. 5.5.2 Amoeba Distributed Operating System Goals and Avantages 141 High Performance Distributed Computing The Amoeba Distributed Operating System was developed at the Free University and Center for Mathematics and Computer Science in Amsterdam [mull90]. Amoeba’s main goal was to build a capability based, object-oriented distributed operating system to support efficient and transparent distributed computing across a computer network. The main advantages of Amoeba are its transparency, scalability, and the support of high performance parallel and distributed computing over a computer clusters. System Description Overview The Amoeba system is an object-oriented capability-based distributed operating system. The Amoeba micro-kernel runs on each machine in the system and handles communication, I/O, and low-level memory and process management. Other operating system functions are provided by servers running as user programs. The Amoeba hardware consists of four principle components (Figure 5.3): workstations, the processor pool, specialized servers, and the gateway. The workstations run computation intensive and interactive tasks such as window management software and CAD/CAM applications. The processor pool represents the main computing power in the Amoeba system and is formed from a group of processors and CPUs. This pool can be dynamically allocated as needed to the user and system processes and returned to the pool once their assigned tasks are completed. For instance, this pool can run computation intensive tasks that do not require frequent interaction. The third component consists of specialized servers that run special programs or services such as file servers, database servers, and boot servers. The fourth component consists of gateways that are used to provide communication services to interconnect geographically dispersed Amoeba systems. Processor pool Gateway Servers Super computer servers Multicomputer WAN ... … . . . ... LAN Processor array … : Workstations … Terminals Figure 5.3. Ameoba Distributed Operating System Architecture. 142 High Performance Distributed Computing The Amoeba software architecture is based on the client/server model. Clients submit their request to a corresponding server process to perform operations on objects [mull90]. Each object is identified and protected by a capability that has 128 bits, and is composed of four fields, as shown in Figure 5.4. 48 Service Port 24 Object Number 8 Right Field 48 Check Field Bits Figure 5.4 Capability in Amoeba The service port field is a 48-bit sparse address identifying the server process that manages the object. The object number field is used by the server to identify the object requested by the given capability. The 8-bit right field defines the type of operations that can be performed on the object. The 48-bit check field protects the capability against forging and tampering. The rights in the capability are protected by encrypting them with a random number, storing the result in the check field. A server can check a capability by performing the encryption operation, using the random number stored in the server's tables, and comparing the result with the check field in the provided capability. The Amoeba kernel handles memory and process management, supports processes with multiple threads, and provides inter-process communications. All other functions are implemented as user programs. For example, the directory service is implemented in the user space and provides a higher level of naming hierarchy that maintains a mapping of ASCII names onto capabilities. Inter-Process Communication All communications in Amoeba follows the request/response model. In this scheme the client makes a request to the server. The server performs the requested operation and then sends the reply to the client. This communication model is built on an efficient remote procedure call system that consists of three basic system calls: 1. get_request(req-header, req-buffer, req-size) 2. do_operation(req-header,req-buffer,req-size,rep-header, rep-buffer,rep-size) 3. send_reply(rep-header, rep-buffer, rep-size) When a server is ready to accept requests from clients, it executes a get_request system call that forces it to block. In this call, it specifies the port on which it is willing to receive requests. When a request arrives, the server unblocks, performs the work using 143 High Performance Distributed Computing the parameters, and sends back the reply. The reply is sent using the send_reply system call. The client makes a request by issuing a do_operation. The caller is blocked until a reply is received, at which time the reply parameters are populated and a status returned. The returned status of a do_operation can be one of the following: 1. The request was delivered and has been executed. 2. The request was not delivered or executed. 3. The status is unknown. To enable system programmers and users to use richer set of primitives for their applications, a user-oriented interface has been defined. This has lead to the development of the Amoeba Interface Language (AIL) that can be used to describe operations to manipulate objects and support multiple-inheritance mechanism. Stub routines are used in Amoeba to hide the marshaling and message passing from the users. The AIL compiler produces stub routines automatically in the C language. Resource Management In many applications, processes need a method to create child processes. In Unix a child process is created using the fork primitive. An exact copy of the original process is created. This process runs housekeeping activities and then issues an exec primitive to overwrite its core image with a new program. In a distributed system, this model is not attractive. The idea of first building an exact copy of the process, possibly remote, and then throwing it away again shortly thereafter is inefficient. The Amoeba uses a different strategy. The key concepts are segments and process descriptors. A segment is a contiguous chunk of memory that can contain code or data. Each segment has a capability that permits its holder to perform operations on it, such as reading or writing. A process descriptor is a data structure that provides information about a process. It provides the process state that is either running or stunned. A stunned process is a process being debugged or migrated; that is the process exists but does not execute any instructions. When the process descriptor arrives at the machine where the process will run, the memory server extracts the capabilities for the remote segments and fetches the code and the data segments from where they reside. This is done by using the capabilities to perform READ operations in the usual way; the contents of the segment are copied to it. In this manner, the physical location of the machines involved becomes irrelevant. Once all the segments have been filled in, the process can be constructed and initiated. A capability for the process is then returned to the initiator. This capability can be used to kill the process or it can be passed to a debugger to stun it, read and write its memory, and so on. To migrate a process, it must first be stunned. Once stunned, the kernel sends its state to a handler. The handler is identified using a capability present in the process's state. The handler then passes the process descriptor on to the new host. The new host fetches its memory from the old host through a series of file reads. Then the process is started and the capability returned to the handler. Finally, the handler sends a kill in its reply to the 144 High Performance Distributed Computing old host. Processes trying to communicate with a migrating process get a ``Process Stunned'' reply when it is stunned, and a ``Process Not Here'' reply when the migration is complete. It is the responsibility of the requesting process to find the location of the process it is attempting to contact. File System The Amoeba file system consists of two parts: the file service and the directory service. The file service provides a mechanism for users to store and retrieve files, whereas the directory service provides users with facilities for managing and manipulating capabilities. The file system is implemented using the bullet service. The bullet service does not store files as collections of fixed-size disk blocks. It stores all files contiguously both on disk and in the bullet server's memory. When the bullet server is booted, the entire i-node table is read into memory in a single disk operation and kept there while the server is running. When a file operation is requested, the object number field in the capability is extracted, which is an index into the table. The file entry gives the disk address as well as the cache address of the contiguous file as shown in Figure 5.5. No disk access is needed to fetch the i-node and at most one disk access is needed to fetch the file itself, if it is not in the cache. The simplicity of this design trades high performance for space. File table File 1 Data File 2 Data File 3 Data Figure 5.5 Amoeba File System The Bullet service supports only three basic operations on files: read file, create file and delete file. There is no write file that makes the files immutable. Once a file is created it cannot be modified unless explicitly deleted. When a file is created, the user provides all the required data and a capability is returned. Keeping files immutable and storing them contiguously has several advantages: 145 High Performance Distributed Computing • • • File retrieval is carried out in one disk read Simplified file management and administration Simplified file replication and caching because of the elimination of inconsistency problems The bullet service does not provide a high level naming service. To access a file, a process must provide the relevant capability. Since working with 128-bit binary numbers is not convenient for users, a directory service has been designed and implemented to manage names and capabilities. The directory in Amoeba has a hierarchical structure which facilitates the implementation of partially shared name spaces. Directories in Amoeba are also treated as objects and users need capabilities to access them. Thus a directory capability is a capability for many other capabilities. Essentially, a directory is a map from ASCII strings onto capabilities. A process can present a string, such as a file name, to the directory server, and the server returns the capability for that file. Using this capability, the process can access the file. Security and Protection In Amoeba, objects are protected using capabilities and server access is controlled by ports. Access to objects such as files, directories, or I/O devices can be granted or denied to users by specifying these access rights in the capabilities. The capabilities themselves must be protected against illegal tampering by users. To hide the contents of capabilities from unauthorized users they are encrypted using a key chosen randomly from a large address space. This key is stored in the check field of the capability itself. The repositories of the capabilities, the directories, can also be encrypted so that bugs in the server or the operating system do not reveal confidential information. The Amoeba system also provides a method for secure communications. A function box, or F-box, is put between each computer and the network. This F-box may be implemented in hardware as a VLSI chip on the network interface board, or as software built into the operating system. The second approach can be used in trusted computers. The F-box performs a simple one way function, P = F (G), where given F and G, P can be easily found by applying the function F. However, given P and F is computationally intractable to determine G (see Figure 5.6). Thus, to protect the port on which the servers listen to, Amoeba makes the server port P known publicly, whereas G is kept secret. When a server performs the operation get_request (G), the F-box computes P = F(G) and waits for messages to arrive on P. On the other hand, when a client issues a do_operation(P), the F-box does not carry out any transformation. P=F(G) Put_addr (P) known Get_addr (G) secret 146 High Performance Distributed Computing Figure 5.6 Amoeba security based on a one-way function If a user tries to impersonate a server by issuing a get_request (P) (G is a secret), the user will be listening to port F(P), which is not the server's port and is useless. Similarly, when a server sends a reply to a client, the client listens on port G' = F(P'), where P' is contained in the message sent by the server. Thus both servers and clients can be protected against impersonation using this simple scheme. Further, the F-box may be used for authenticating digital signatures. The original signature S is known only to the sender of the message, whereas S' = F(S) could be known publicly. Other Services The Amoeba system provides limited fault tolerance. The file system can maintain multiple copies of files to improve reliability and fault tolerance. The directory service is crucial for the functioning of the system since it is the place where all processes look for the capabilities they need. The directory service replicates all its internal tables at multiple sources to improve the service reliability and fault tolerance. 5.5.3 The V Distributed Operating System Goals and Advantages The V distributed operating system was developed at Stanford University to research issues of designing a distributed system using a cluster of workstations connected by a high performance network. [cher88]. The V system uses high performance inter-process communications that are comparable to local inter-process communications. The V system uses a special communication protocol, Versatile Message Transaction Protocol (VMTP), which is optimized to support RPCs and client-server interactions. The V system provides efficient mechanisms to implement shared memory and group communications. System Description Overview The V distributed operating system is a microkernel based system with remote execution and migration facilities. The V system provides resource and information services of conventional time-shared single computers to a cluster of workstations and servers. The V -kernel is based on three principles: 1) high-performance communication is the most critical component in a distributed system 2) uniform general purpose protocols are needed to support flexibility in building general purpose distributed systems 3) a small micro-kernel is needed to implement the basic protocols and services (process management, memory and communications). The V system has an efficient communication protocol to deliver efficient data transport, naming, I/O, atomic transactions, remote execution, migration and synchronization. The 147 High Performance Distributed Computing V microkernel handles inter-process communication, process management, and I/O management. All other services are implemented as service modules running in the user mode. The other distributed system services are implemented at the process level in a machine and network independent manner. System services are accessed by application programs using procedural interfaces. When a procedure is invoked, it attempts to execute the function in its own address space if possible. If unsuccessful, it uses the uniform inter-process communication protocol (VIPC) to contact the appropriate service module and implements the requested procedure. To support a global system state, distributed shared memory has been implemented by caching the shared state of the system at different nodes as virtual memory pages. Inter-Process Communication: In the V system, the kernel and VMTP are optimized to achieve efficient inter-process communications as outlined below: 1. Supporting Short Messages: Most of the RPC traffic and 50% of V’s messages can fit in 32 byte messages. Consequently, the V system has been optimized at the kernel interface and network transmission level to transfer fixed-size (32 bytes) short messages very efficiently. 2. Efficient Communication Protocol: The V system uses a transport protocol called the Versatile Message Transaction Protocol (VMTP) that has been optimized for request-response behavior. There is no explicit connection setup and tear down in order to improve the performance of request/response communication services. VMTP optimizes error handling and flow control by using the reply message as an acknowledgment to the client request. VMTP also supports datagrams, multicast, priority, and security. 3. Efficient Message Processing: The cost of communication processing at the kernel level has been minimized by having process descriptors contain a template VMTP header with some of its fields initialized at process creation. This leads to reducing the time required to prepare the packets for network transmission. The inter-process communication mechanisms provided are based on request/response message and shared memory models. In the request/response model, a process may communicate with another by sending and receiving fixed-size messages. A send by a process blocks itself until a reply has been received. This implementation utilizes message passing with blocking semantics and corresponds to the RPC model. In the message model the server is implemented as a dedicated server that receives and acts on incoming messages. The client request is queued if the server process is found to be busy serving other requests. The message model is preferred when the requests to a server need to be serialized. On the other hand, the RPC mechanism is preferred when the requests need to be handled concurrently by the server. In both cases, the receiver or the server performs a receive to process the next message and then invokes the appropriate procedures to handle the message and send back a reply. 148 High Performance Distributed Computing The V system also supports distributed shared memory. In this form of inter-process communication a process can pass a segment in its team space (address space on the computer) to another process. After sending the message, the sender is blocked, while the receiver can read or write its segment using the primitives, CopyFrom and CopyTo, provided by the kernel. The sender is unblocked only when it receives a reply from the other process, once its task is completed. This model utilizes the data segment provided by VMTP. The V inter-process communication also supports group communication. This is necessary due to the number of processes working together in distributed systems. V supports the notion of a process group - a set of processes having a common group identifier. Any process can send and receive messages from process groups. To send a message to a group requires a group identifier instead of the processor identifier as the parameter. The multicast feature is exploited by different V services. For instance, it is used by the file service for replicated file updates and by the scheduler for collecting and dispersing load information. Resource Management: The key resources that the kernel manages are processes, memory, and devices. Other shared devices are managed by user-level servers. For example, printers and files are managed by the print server and the file server respectively. Process Management: The V kernel simplifies process management by reducing the tasks required to create and terminate a process and migrating some of the kernel process management tasks to user-level programs. The V system makes process initiation independent of address space creation and initialization. This makes process creation to be a matter of allocating and initializing a new process descriptor. Process termination is also simplified because there are few resources at the kernel level to reclaim. Client kernels do not inform server modules when processes are killed or terminate. Each server module in the V system is responsible for periodically checking the client process to see if it exists. If it does not it reclaims the resource. In a file server module there exists a “garbage collector” process that closes files associated with dead processes. The V kernel scheduling policy has been designed to be simple and small. In an N node distributed system, the scheduling policy is to have all N processors run the N highest priority processes at any given time. This requires exchanging information with other computers in the system to implement distributed scheduling by maintaining statistics about the load on its host. A user can run programs on the least lightly loaded machine, thus utilizing CPU cycles on idle workstations. This is carried out transparently by the scheduler. Users can effectively utilize most of the CPU time of the workstations in the cluster, eliminating the need for a dedicated processor pool. Remote processes, called guest processes are run at a guest priority to minimize their effect on the workstation user's processes. A user can even off load guest processes by migrating them to other nodes. A process can be suspended by setting its priority to a low value. Facility is also provided for freezing and unfreezing a process, which is used to control modifications to its address space during migration. 149 High Performance Distributed Computing Memory Management: The V kernel memory manager supports demand paging by associating address spaces to an open file. The kernel provides caching and consistency mechanisms for these regions. A reference to an address in a region is interpreted as a reference to the corresponding file data. A page fault is generated when a reference is made to an address in an unbounded region. On a page fault, the kernel either binds the corresponding block in the file to the region or makes the process do a READ request to the server that is managing the file. An address space is created simply by allocating a descriptor and binding an open file to that address. Device Management: Conventionally, I/O facility is considered as the primary means of communication between programs and their environment. In the V system, inter-process communication services are used for this purpose. In V, I/O is just a higher level protocol used with IPC to communicate a larger grain of information. All I/O operations follow a uniform protocol called the UIO protocol. The UIO protocol is block oriented which allows reading and writing blocks of data on a device. The kernel provides device support for disks, mice, and network interfaces through the device server. The device server is an independent machine that implements the UIO system interface. Process level servers for other devices are built upon the basic interfaces provided by the kernel. Naming Service: The V naming scheme is implemented as a three level model that consists of character string names, object identifiers, and entity identifiers. In this model, there is no specific name server. Each object manager implements the naming service for the set of objects it manages. This requires the maintenance of a unique global naming scheme. This is achieved by picking a unique global name prefix for each manager and adding it to the name handling process group. A client can locate the object manager for a given string by doing a multicast QueryName operation. To avoid the multicast operation each time an operation has to be performed on an object, each program maintains a cache of name prefix to manager bindings. This cache is initialized at the time of process initialization to avoid delays during execution. As both the directory for the objects and the objects themselves are implemented by the same server, consistency between the two increases greatly. The directory is replicated to the same extent as the object itself. This eliminates the situation in which objects are inaccessible because of the failure of the name server. Also, new services implementing their own directories can be easily incorporated. To avoid the overhead of character string name lookup, an object identifier is used to refer to the object in subsequent operations. An object identifier consists of two fields with the first one specifying the identification (ID) of the manager and the second one specifying the ID of the object. The transport level identifier of the server can be used as its ID to speed up its location. This model also allows the efficient allocation of local identifiers, as the manager does not have to communicate with a name server to implement it. The identifier just has to be unique on that server and it becomes globally unique when it is prefixed with the manager ID. Finally, entity identifiers are fixed-length binary values that identify transport level endpoints. The entity identifiers identify processes and serve as process and group 150 High Performance Distributed Computing identifiers. Their most important property is that they are host-address independent. Thus processes can migrate without affecting their entity identifiers. To achieve this independence they must be implemented globally. This gives rise to allocation problems as it requires cooperation among all instantiations of the kernel. Entity identifiers are mapped to host addresses using a mechanism similar to that used for mapping character string names. A cache of these mappings and the multicast facility are used by the kernel to improve performance. File Service: The file system in V is implemented as a user-level program running on a central file system server outside the operating system. The V system implements the mapped-file I/O mechanism to provide fast and efficient access to files. A file OPEN request is mapped onto a locally cached open file. Files are accessed using the standard UIO interface. In the UIO model, I/O operations are carried out by creating UIO objects. These objects are equivalent to open files in other systems. Operations like read, write, query, or modify are performed on these objects. The UIO interface implements blockoriented access instead of conventional byte streams to facilitate efficient handling of file blocks, network packets, and database records. Read and write requests on files are carried out locally if the data is present in the cache. Otherwise, a READ request is sent out to the server managing the original file and the addressed block is brought in and cached. Using the UIO operations the kernel can take advantage of the workstations’ large physical memories. Caching files increases performance, as a local block access is even faster than getting a block from backing store. This technique is extremely beneficial in the case of diskless workstations, facilitating their usage and reducing the overall cost of the system. The V file server uses large buffers of 8 Kbytes for efficiently transferring large files over the network with minimal overhead. Files in V use a 1 Kbyte allocation unit. Even though they are divided into 1K blocks, a contiguous allocation scheme is used so that most files are stored contiguously on disk. Other Services: The V system provides limited security and protection in that it assumes that the kernel, servers, and the messages across the network cannot be tampered with. Thus network security against intruders is not supported in the V system. The protection scheme used is similar to the one provided by UNIX. Each user has an account name and an associated password. A user is also given a user number. This user number is associated with the messages sent by processes created by the user. Security and protection is achieved using an authentication server that matches the name and the password against an encrypted copy stored on it. On a match the server returns a success, upon which the kernel associates the user number with that process. Thus any message sent or received by the process from this point on would contain the user number in it. The V system does not provide any specific reliability or fault tolerance techniques. Software reliability is achieved by keeping the kernel small and the software modular. 151 High Performance Distributed Computing 5.7.4 MACH Distributed Operating System Goals and Advantages The MACH operating system is designed to integrate the functionality of both distributed and parallel computing in an operating system that is binary compatible with Unix BSD. This enables Unix users to migrate to a distributed operating system environment without giving up convenient services. The main objectives of MACH are the ability to support (emulate) other operating systems, support all types of communication services (shared memory and message passing), provide transparent access to network resources, exploit parallelism in systems and applications, and provide portability to a large collection of machines [tane95]. *** againg we need to check this reference, I believe it is Tannenbaum 95 book on distributed operating system 8***** System Description Overview Mach is a micro-kernel based message-passing operating system. It has been designed to provide a base for building new operating systems and emulating existing ones (e.g., UNIX, MS-Windows, etc.) as shown in Figure 5.7. The emulated operating systems run on top of the kernel as servers or applications in a transparent manner. MACH is capable of supporting applications with more than one operating system environment simultaneously (Unix and MS-DOS). User processes User space Software Emulator Layer 4.3 BSD emulator System V Emulator HP/UX emulator Other emulator Microkernel Kernel space Figure 5.7 MACH Distributed Operating System The Mach kernel provides basic services to manage processes, processors, memory, inter-process communications, and I/O services. The Mach kernel supports five main abstractions [tane95]: processes, threads, memory objects, ports, and messages. A process 152 High Performance Distributed Computing Suspend counter Scheduling parameters Emulation address Statistics Address Space Process Threads Process Port Bootstrap Port Exception Port Registered Port Kernel in Mach consists primarily of an address space and a collection of threads that execute in that address space as shown in Figure 5.8. Figure 5.8 Mach: Process Management In Mach, processes are passive and are used for collecting all the resources related to a group of cooperating threads into convenient containers. The active entities in Mach are the threads that execute instructions and manipulate their registers and address spaces. Each thread belongs to exactly one process. A process cannot do anything unless it has one or more threads. A thread contains the processor state and the contents of a machine's registers. All threads within a process share the virtual memory address space and communications privileges associated with their process. The UNIX abstraction of a process is simulated in Mach by combining a process and a single thread. However, Mach goes beyond this abstraction by allowing multiprocessor threads to execute in parallel on separate processors. Mach threads are heavyweight threads because they are managed by the kernel. Mach adopts the memory object concept to implement its virtual memory system. A memory object is a data structure that can be mapped into a process’ address space. InterProcess communication is based on message passing implemented using ports. Ports are kernel mailboxes that support unidirectional communication. Resource Management The main resources to be managed include processes, processors, and memory. The Mach management of these resources is presented next. Process Management: A process in Mach consists of an address space and a set of threads running in the process address space. The threads are the active components while processes are passive and act as containers to hold all the required resources for its thread’s execution. Processes use ports for communication. Mach provides several ports to communicate with the kernel such as process port, bootstrap port, and exception port. The kernel services available to processes are requested by using process ports rather than making a 153 High Performance Distributed Computing system call. The bootstrap port is used for initialization when a process starts up in order to learn the names of kernel ports that provide basic services. The exception port is used by the system to report errors to the process. Mach provides a small number of primitives to manage processes such as create, terminate, suspend, resume, priority, assign, info, and threads. Thread Management: Threads are managed by the kernel. All the threads belonging to one process share the same address space and all the resources associated with that process. In uni-processor system, threads are time shared. However, they run in concurrently in a multiprocessor system. The primary data structure used by the Mach to schedule threads for execution is the run queue. The run queue is a priority queue of threads implemented by an array of doubly linked queues. A hint is maintained to indicate the probable location of the highest priority thread. Each run queue also contains a mutual exclusion lock and a count of threads currently queued. When a new thread is needed for execution, each processor consults the appropriate run queue. The kernel maintains a local run queue for each processor and a shared global run queue. Mach is self-scheduling in that instead of having threads assigned by a centralized dispatcher, individual processors consult the run queues when they need a new thread to run. A processor examines the local run queue first to give local threads absolute preference over remote threads. If the local queue is empty, the processor examines the global run queue. In either case, it dequeues and runs the highest priority thread. If both queues are empty, the processor becomes idle. Processor allocation: Mach aims at supporting a multitude of applications, languages, and programming models on a wide range of computer architectures. Consequently, the processor allocation approach must be flexible and portable to many different platforms. The processor allocation approach adds two new objects to the Mach kernel interface, the processor and the processor set. Processor objects correspond to and manipulate physical processors. Processor objects are independent entities to which threads and processes can be assigned. Processors only execute threads assigned to the same processor set and vice versa, and every processor and thread is always assigned to the same processor set. If a processor set has no assigned processors then threads assigned to it are suspended. Assignments are initialized by an inheritance mechanism. Each process is also assigned to a processor set but this assignment is used only to initialize the assignment of threads created in that process. In turn, each process inherits its initial assignment from its parent upon creation and the first process in the system is initially assigned to the default processor set. In the absence of explicit assignments, every thread and process in the system inherits the first process's assignment to the default processor set. All processors are initially assigned to the default processor set and at least one processor must always be assigned to it so that internal kernel threads and important daemons can remain active. Mach processor allocation approach is implemented by dividing the responsibility for processor allocation among the three components: application, server, and kernel as shown in Figure 5.9. 154 High Performance Distributed Computing Applications control the assignment of processes and threads to processor sets. The server controls the assignment of processors to processor sets. The kernel does whatever the application and server requests. In this scheme, the physical processors allocated to the processor sets of an application can be chosen to match the application requirements. Assigning threads to processor sets gives the application complete control over which threads run on which processors. Furthermore, isolating scheduling policy in a server, simplifies changes for different hardware architectures and site-specific usage policies. Figure 5.9 Processor allocation components Memory management: Each Mach process can use up to 4 gigabytes of virtual memory for the execution of its threads. This space is not only used for the memory objects but also for messages and memory-mapped files. When a process allocates regions of virtual memory, the regions must be aligned on page boundaries. The process can create memory objects for use by its threads and these can actually be mapped to the space of another process. Spawning new processes is more efficient because memory does not need to be copied to the child. The child needs only to touch the necessary portions of its parent's address space. When spawning a child process it is possible to mark the pages to be copied or protected. Each memory object that is mapped in a process' address space must have an external memory manager that controls it. Each class of memory objects is handled by a different memory manager. Each memory manager can implement its own semantics, can determine where to store pages that are not in memory, and can provide its own rules about what happens to objects once they have been mapped out. To map an object into a process' address space the process sends a message to a memory manager asking it to perform the mapping. Three ports are needed to achieve the requested mapping: object port, control port, and name port. The object port is used by the kernel to inform the memory manager when page faults occur and other events relating to the object. The control port is created to enable memory managers to interact and respond to kernel requests. The name port is used to identify the object. The Mach external memory manager concept lends itself well to implementing a page based distributed shared memory. When a thread references a page that it does not possess, it creates a page fault. Eventually the page is located and shipped to the faulting machine where it is loaded so the thread can continue execution. Since Mach already has memory managers for different classes of objects, it becomes natural to introduce a new memory object, the shared page. Shared pages are explicitly managed by one or more 155 High Performance Distributed Computing memory managers. One possibility is to have a single memory manager that handles all shared pages. Another is to have a different memory manager for each shared page or collection of shared pages. This allows the load to be distributed. The shared page is always either readable or writeable. If it is readable it may be replicated on multiple machines. If it is writeable, only one copy exists. The distributed shared memory server always knows the state of the shared page as well as which machine or machines it resides on. Inter-Process Communication The basis of all communication in Mach is a kernel data structure called a port. A port is essentially a protected mailbox. When a thread in one process wants to communicate with a thread in another process the sending thread writes the message to the port and the receiving thread takes it out. Each port is protected to ensure that only authorized processes can send to and receive from it. Ports support unidirectional communication and provide reliable, sequenced message streams. If a thread sends a message to a port, the system guarantees that it will be delivered. Ports may be grouped into port sets for convenience. A port may belong to only one port set. A message is a string of data prefixed by a header. The header describes the message and its destination. The body of the message may be as large as the entire address space of a process. There are simple messages that don't contain any references to ports and nonsimple messages that can reference other ports (conceptually similar to indirect addressing). Messages are the primary way that processes communicate with each other and the kernel. They can even be sent between processes on different computers. Messages are not actually stored in the port itself but rather in another kernel data structure: the message queue. The port contains a count of the number of messages currently present in the message queue and the maximum number of messages permitted. Since messages are actually mapped to the virtual memory resources of processes, interprocess communication is far more efficient than UNIX implementations where messages are copied from one process to the limited memory space of the kernel and then to the process receiving the message. In Mach, the message actually resides in the memory space shared by the communicating processes. Memory-mapped files facilitate program development by simplifying memory and file operations to a single set of operations for both. However, Mach still supports the standard UNIX file read, write, and seek system calls. Mach also supports communication between processes running on different CPUs by using Network Message Servers (NMS). The NMS is a multithread process that performs a variety of functions that include: interfacing with local threads, forwarding messages over the network, translating data types, providing network-wide area lookup service, and providing an authentication services. 156 Formatted: Font: 小四 Formatted: Font: 小四 High Performance Distributed Computing Formatted: Font: 小四 Formatted: Font: 小四 5.8.6 X-Kernel Distributed Operating System Goals and Advantages The X-Kernel is an experimental operating system kernel developed at the University of Arizona, Tucson [hutc89]. X-Kernel can be viewed as a tool kit that provides all the tools and building blocks to develop and experiment with distributed operating systems configured to meet certain class of applications. The main advantages of X-Kernel are XKernel configurability and its ability to provide an efficient environment to experiment with new operating systems and different communications protocols. System Description Overview X-Kernel is a microkernel based operating system that may be configured to support experimentation in inter-process communication and distributed programming. The motivation of this approach is two fold: 1) no communication paradigm is appropriate for all applications 2) The X-Kernel framework may be used to obtain realistic performance measurements. The X-Kernel consists of a kernel that supports memory management, lightweight processes, and development of different communications protocols. The exact configuration of the kernel services and the communication protocols used determines an instance of the X-Kernel operating system. Therefore, we do not consider X-Kernel to be a single operating system, but rather a toolkit for constructing and experimenting with different operating systems and protocols [BORG92]. Resource Management An important aspect of any distributed system involves its ability to pass control and data efficiently between the kernel and user programs. In X-Kernel, the transfer between user applications and kernel space has been made efficient by having the kernel execute in the same address space of the user data (see Figure 5.10). Formatted: Font: 小四 Process stack USP User stack Process stack Formatted: Font: 小四 Formatted: Font: 小四 (Private KSP Kernel code / Data area Kernel stack (Shared (Shared Virtual Address User code / Data area Figure 5.10 X-Kernel and user address space 157 Formatted: Font: 小四 Formatted: Font: 小四 High Performance Distributed Computing The user process can access the kernel processes efficiently because they run in the same address space. This access requires approximately 20 microseconds on a SUN 3/75 [Hutc89]. The kernel process can also access user data after properly setting the user stack and its arguments. A kernel to user data access takes around 245 microseconds on SUN 3/75. Inter-Process Communication: There are three communication objects used by the kernel to handle communications: protocols, sessions, and messages. A different protocol object is used for each protocol type being implemented. The session object is the protocol’s objects interpreter and contains the data representing the protocol state. The message objects are transmitted by the protocol and session objects. There are multiple address spaces in the X-Kernel which consist of the kernel area, user area, and stack. If more than one process exits in a given address space the kernel area is shared but each process has its own private stack. To communicate between processes in different address spaces or different machines the protocol objects are used to send messages. Processes in the same address space are synchronized using kernel semaphores. A process can execute in either user mode or kernel mode. In kernel mode the process has access to the user and kernel areas. In user mode, the process only has access to user information. X-Kernel provides several routines to construct and configure a wide variety of protocols. These routines include buffer manager, map manager, and event manager. The buffer manager uses the heap to allocate buffers to move messages. The map manager is used to map an identifier from a message header to capabilities used by the kernel objects. The event manager allows a protocol to invoke procedures with a timed event. A protocol object can create a session object and de-multiplex messages before being sent to the session object. Three operations that the protocol object uses are open, open_enable, and open_done ( Figure 5.11). Figure 5.11: Protocol (a) and Session (b) objects The open creates a session object caused by a user process, the open_enable and open_done are invoked by an arriving message from the network. The protocol object is also a demux function operation and can send any message from the network to one of the session created by the protocol object that called the demux protocol. Sessions have two operations: push and pop. Push is used by a higher session to send a message to a 158 High Performance Distributed Computing lower session. The pop operation is used by the demux operation of a protocol object to send a message to its session. As a message is passed between sessions, information is added to the header and the message may be split up into several messages. If a message travels from a device to the user level, it can be put into a larger message. From the user level to the device, a header can be added. The X-Kernel has been designed to successfully implement different protocols, and this kernel can be used to research new protocols. Figure 5.12 demonstrates a collection of protocol objects that are supported by the X-Kernel. It is clear from this figure that the TCP protocol is supported directly by user programs. The use of the object model to represent any protocol makes the interface clean; any protocol can access any other protocol. For the user to access a protocol, it makes itself a protocol and then accesses the target protocol. Figure 5.12 X-Kernel protocol suite Naming and File Service: The X-Kernel file system allows its users to access any X-Kernel resource regardless of the user location. It provides a uniform interface by implementing a logical file system that can contain several different physical file systems. Furthermore, the logical file system concept allows the file system to be tailored to user's requirements instead of being tied to a machine architecture. Figure 5.13 shows a private file system tree structure that contains several partitions (proj1, proj2, twmrc, journal, conf and original). Each partition can be implemented using different physical file systems. 159 High Performance Distributed Computing Formatted: Font: 小四 1 proj1 src proj2 bin doc journal src twmrc paper conf original Figure 5.13 File protocol configuration The file system is logical in that it provides only directory service and relies on an existing physical file system for the storage protocols. The Logical File System (LFS) maps a file name to the location it can be found. The file system has two unique features. First, each user application defines its own private file system created from the existing physical file system. The separation of directory functions from the storage functions is achieved through the use of two protocols: The Private Name Space (PNS) protocol that implements the directory function and the Uniform file Access (UFA) that implements the storage access to a given file system. LFS PNS UFA NFS AFS FTP Figure 5.14 Private file hierarchy In a similar manner to the services provided by the X-Kernel logical file system, the XKernel command interpreter provides uniform access to a heterogeneous collection of network services. This interpreter is different from the services offered by distributed operating systems in that its services not only provide the local resources available in a local network, but also provide resources which are available throughout a wide area network. 160 High Performance Distributed Computing WebOS (Operating System for Wide Area Applications) Goals and Advantages This Operating System was developed at the University of California, Berkeley and aims to provide common set of Operating System services to wide-area applications. The system offers all the basic services such as global naming, cache coherent file system, resource management and security. It simplifies the development of dynamically reconfiguring distributed applications. System Description Overview The WebOS components together provide the wide-area analogue to local area operating system services, simplifying the use of geographically remote resources. Since, most of the services are geographically distributed, client applications should be able to identify the server, which can give the best performance. In WebOS, global naming includes mapping a single service identity to multiple servers; mechanism for load balancing among available servers and maintaining enough state to perform fail over if a server becomes unavailable. The above functions are accomplished with the use of Smart Clients that extend service-specific functionality to client machines. Wide scale sharing and replication are implemented through a cache coherent wide area file system. WebFS is an integral part of this system. The performance, interface and caching are comparable to the existing distributed file systems. WebOS defines a model of trust providing both security guarantees and an interface for authenticating the identity of principals. Fine-grained control of capabilities is provided for remote process execution on behalf of principals. The system is responsible for authenticating the identity of the user, who requested the remote process execution and the execution should be as natural and productive as local operation. Resource Management A resource manager on each WebOS machine is responsible for job requests from remote sites. Prior to execution, resource manager authenticates the remote principal identity and determines if access rights are available. The resource manager creates a virtual machine for process execution so that running processes do not interfere with one another. Processes will be granted variable access to local resources through the virtual machine depending on the privileges of the user responsible for creating the process. The local administrator on a per-principal basis sets configuration scripts and these configuration scripts determine the access rights to the local file system, network and devices. WebOS also uses virtual machine abstraction as the basis for local resource allocation. A process runtime priority is set using the system V priocnt 1 system call and setrlimit is used to set the maximum amount of memory and CPU usage allowed. 161 High Performance Distributed Computing Naming WebOS provides a useful abstraction for location independent dynamic naming. Client applications can identify representatives of geographically distributed and dynamically reconfiguring services using this abstraction and choose the appropriate server based on load conditions and end-to-end availability. In order to provide the functionalities mentioned above, first each service name is mapped to a list of replicated representatives providing the service. Then, the server capable of best performance is selected and the choice is dynamic and non-binding. Enough state is maintained to recover from failure due to unavailability of a service provider. Naming in WebOS is in the context of HTTP service accessed through URL’s. Ideally, users refer to a particular service with a single name and the system translates the name to the IP address of the replica that will provide the best service. The selection decision is based on factors such as load on each server, traffic conditions of the network and client location. Loading application and server specific code into end clients performs the name translation. Extensions of server functionality can be dynamically loaded onto the client machine. These extensions are distributed as java applets Two cooperating threads make up the architecture of a Smart Client. A customizable graphical interface thread implements the user view of the service and a director thread performs load balancing among the representative servers and maintains state to handle failures. The interface and director threads are extensible according to the service. The load balancing algorithm uses static state information such as available servers, server capacity, server network connectivity, server location and client location as well as load information piggy-backed with some percentage of server responses. The client then chooses a server based on static information biased by server load. Inactive clients must initially use only static information, as the load information would have become stale. File Service WebOS provides a global cache coherent, consistent and secure file system abstraction that greatly simplifies the task of application programmers. The applications in a wide area network are diverse and may require different variations in the abstraction. Some applications may require strong cache consistency, while some others are more concerned on reduction in overhead and delay. WebFS allows applications to influence the implementation of certain key abstractions depending on their demands. A list of user-extensible properties is associated with each file to extend basic properties such as owner and permissions, cache consistency policy, prefetching and cache replacement policy and encryption policy. WebFS uses a URL-based namespace, and the WebFS daemon uses HTTP for access to standard web sites. This provides backward compatibility with existing distributed applications 162 High Performance Distributed Computing Figure 5.15 Graphical Illustration of the WebFS Architecture WebFS is built at the UNIX vnode layer with tight integration to the virtual memory system for fast cached accesses. The WebFS system architecture consists of two parts: a user-level daemon and a loadable vnode module. The various steps in the read to a file in the WebFS namespace are: - Initially, the user-level daemon spawns a thread, which makes a system call intercepted by the WebFS vnode layer. The layer then puts that process to sleep until work becomes available for it. When an application makes the read system call requesting on a WebFS file, the operating system translates the call into a vnode read operation. The vnode operation checks to see if the required page is cached in virtual memory. If the page was not found in virtual memory, one of the sleeping threads is woken up and the user level daemon is then responsible for retrieving the page, by contacting a remote HTTP or WebFS daemon (See Figure 5.15). Once the required page is found, the WebFS daemon makes a WebFS system call. The retrieved page is cached for fast access in the future. Presence of multiple threads ensures concurrent file access. The advantages offered by this system include improved performance due to caching, global access to HTTP namespace due to presence of vnode layer and easy modification of the user level daemon. The Cache consistency protocol for traditional file access in WebFS is “Last Writer Wins” (Figure 5.16). IP multicast-based update/invalidate protocol is used for widely shared, frequently updated data files. Multicast support can also be useful in the context of Web browsing and wide-scale shared access in general. Use of multicast to deliver updates can improve client latency while simultaneously reducing server load. Figure 5.16 Implementation of “Last Writer Wins” 163 High Performance Distributed Computing Security A wide area security system provides fine-grained transfer of rights between principals in different administrative domains. The security abstraction of WebOS transparently enables such rights transfer. The security system is called CRISIS and it implements the transfer of rights with the help of lightweight and revocable capabilities called transfer certificates. They are signed statements granting a subset of the signed principals' privileges to a target principal. All CRISIS certificates must be signed and counter signed by authorities trusted by both the service provider and the consumer. Stealing keys is extremely difficult as it involves subverting two separate authorities. Transfer certificates can be revoked before timeout in the case of stolen keys (see Figure 5.17). Each CRISIS node runs a security manager, which controls the access to local resources and maps privileges to security domains. A security domain is created for each login session containing the privileges of the principal who successfully logged in. CRISIS associates names with a specific subset of principals' privileges and these are called roles. A user creates a role by generating an identity certificate containing a new public/private key pair and a transfer certificate that describes the subset of principals privileges transferred to the role. For authorization purposes, CRISIS maintains access control lists (ACL) to associate principals and groups with the privileges granted to them. File ACLs contain permissions given to principals for read, write or execute. Process execution ACLs contain the list of principals permitted to run jobs on the given node. A reference monitor verifies all the certificates for expiry and signatures and reduces all certificates to the identity of single principals. The reference monitor checks the reduced list of principals against the contents of the object ACL, granting authorization if a match is found. Figure 5.17 Interaction of CRISIS with Different Components 164 High Performance Distributed Computing 2K Operating System Goals and Advantages 2K aims to offer distributed operating system services addressing the problem of heterogeneity and dynamic adaptability. One of the important goals of this system is to provide dynamic resource management for high-performance distributed applications. It has a flexible architecture, enabling creation of dynamic execution environments and management of dependencies among the various components. System Description Overview 2K Distributed Operating System is being researched at the University of Illinois, Urbana-Champaign. 2K provides an integrated environment to support: dynamic instantiation, heterogeneous CPUs, and distributed resource management. 2K operates as configurable middleware and does not rely solely on the TCP/IP communications protocol. Instead, it provides a dynamically configurable reflective ORB providing CORBA compatibility. The heart of this project is an adaptable microkernel. This microkernel provides “What you need is what you get” (WYNIWYG) support. Only those objects needed by application are loaded. Resource management is also object based. All elements are represented as CORBA objects and each object has a networkwide identity (similar to Sombrero). Upon dynamic configuration, objects that constitute a service are assembled by the microkernel. Applications have access to system’s dynamic state once they have negotiated a connection. Each system node executes a Local Resource Manager (LRM). Naming is CORBA compliant and object access is restricted to controlled CORBA interfaces. Figure 5.18 2K OVERALL ARCHITECTURE 165 High Performance Distributed Computing Resource Management Service The Resource Management Service is composed of a collection of CORBA servers. The services offered by this system are • Maintaining information about the dynamic resource utilization in the distributed system • Locating the best candidate machine to execute a certain application or component based on its QoS prerequisites • Allocating local resources for particular applications or components Local Resource Managers (LRMs) present in each node of the distributed system are responsible for exporting the hardware resources of a particular node to the whole network. The distributed system is divided in clusters and a Global Resource Manager (GRM) manages each cluster. LRMs send periodic updates to the GRM on the availability of their resources (see Figure 5.19). The GRM performs QoS-aware load distribution in its cluster based on the information obtained from the LRMs. Efforts are underway to combine GRMs across clusters, which will provide hardware resource sharing through the Internet. Although LRMs check the state of their local resources frequently (e.g., every ten seconds), they only send this information to the GRM when there were significant changes in resource utilization since the last update or a certain time has passed since the last update was sent. In addition, when a machine leaves the network, the LRM deregisters itself from the GRM database. If the GRM does not receive an update from an LRM for a certain time interval, it assumes that the machine with that LRM is inaccessible. Figure 5.19 Resource Management Service The LRMs are also responsible for tasks such as QoS-aware admission control, resource negotiation, reservation and scheduling of tasks in the individual nodes. These tasks are accomplished with the help of Dynamic Soft Real-Time Scheduler, which runs as a user level process in conventional operating systems. The system’s low-level real-time API is utilized to provide QoS guarantees to applications with soft real-time requirements. 2K uses a CORBA trader to supply resource discovery services. Both the LRM and the GRM export an interface that let clients execute applications (or components) in the 166 High Performance Distributed Computing distributed system. When a client wishes to execute a new application, it sends a request with the QoS specifications to the local LRM. The LRM checks whether the local machine has enough resources to execute the application. If not, it forwards the request to the GRM, which uses its information about the resource utilization in the distributed system to select a machine, which is capable of executing that application. The request is forwarded to the LRM of the machine selected. The LRM of that machine tries to allocate the resources locally, if it is successful, it sends a one-way ACK message to the client LRM. If it is not possible to allocate the resources on that machine, it sends a NACK back to the GRM, which then looks for another candidate machine. If the GRM exhausts all the possibilities, it returns an empty offer to the client LRM. When the system finally locates a machine with the proper resources, it creates a new process to host the application. The Automatic Configuration Service fetches all the necessary components from the Component Repository and dynamically loads them into that process. Automatic Configuration Service Automatic Configuration Service aims to automate the process of software maintenance (Figure 5.20). The objective of this Automatic Configuration Service is to provide Network-Centrism and implement ``What You Need Is What You Get'' (WYNIWYG) model. All the network resources, users, software components, and devices in the network are represented as distributed objects. Each entity has a network-wide identity, a network-wide profile, and dependencies on other network entities. When a particular service is configured, the entities that constitute that service are assembled dynamically. Presence of a single network-wide account and a single network-wide profile for a particular user is the highlight of the network centric model. The access to a users profile is available throughout the distributed system. The middleware is responsible for instantiating user environments dynamically according to the user's profile, role, and the underlying platform. In the What You Need Is What You Get (WYNIWYG) model, the system configures itself automatically and only the essential components required for the efficient execution of the users application are loaded. The components are downloaded from the network, so only a small subset of system services are needed to bootstrap a node. This model offers an advantage over the existing operating systems and middleware, as it does not carry along unnecessary modules, which are not needed for the execution of the specified user application. Each application, system, or component has specific hardware and software requirements and the collection of these requirements is called Prerequisite Specifications or, simply, Prerequisites. The Automatic Configuration Service will also have to take care of the Dynamic Dependencies between the various components during runtime. CORBA objects called Component Configurators store these dependencies as lists of CORBA Interoperable Object References (IOR) linking to other Component Configurators, forming a dependence graph of distributed components. 167 High Performance Distributed Computing Figure 5.20 Automatic Configuration Framework Dynamic Security CORBA interfaces are used for gaining access to the services offered by 2K. The OMG Standard Security Service incorporates authentication, access control, auditing, object communication encryption, non-repudiation and administration of security information. CORBA Security Service is implemented in 2K utilizing the cherubin security framework to support dynamic security policies. Security policies vary depending on the prevailing system conditions and this dynamic reconfiguration is introduced with the help of reflective ORBs. The implementation supports various access control models. The flexibility provided by the security system in 2K is helpful to different kinds of applications. Reflective ORB A CORBA-compliant reflective ORB, dynamicTAO offers on-the-fly reconfiguration of the ORB internal engine and applications running on it. The Dynamic Dependencies between ORB components and application components are represented using ComponentConfigurators in dynamicTAO. The various subsystems that control security, concurrency and monitoring can be dynamically configured with the support of dynamicTAO. An interface is present for loading and unloading modules into the system 168 High Performance Distributed Computing runtime and for changing the ORB configuration state. The fact that dynamicTAO is heavyweight and uses up substantial amount of resources, makes them inappropriate for environments with resource constraints. A new architecture was developed for the 2K system enabling adaptability to resource constraints and wide range of applications, known as LegORB. LegORB occupies less memory space and is suited for small devices as well as high-performance workstations. Web Operating System (WOS) Goals and Advantages WOS was designed to tackle the problem of heterogeneity and dynamism present in web and the Internet. WOS uses different versions of a generic service protocol rather than fixed set of operating system services. There are specialized versions of the generic protocol, developed for parallel/distributed applications and for satisfying high performance constraints of an application. System Description Overview To cope with the dynamic nature of the Internet, different versions of Web Operating System are built based on demand-driven configuration techniques. Service classes are implemented, suited to specific user needs. Communication with these service classes is established with the help of specific instances of the generic service protocol (WOSP), called versions of WOSP. The specific versions of WOSP represent the service class they support. Addition of a service class to a WOS node is independent of the presence of other service classes and does not require any re-installation. The different protocol versions occupying a node constitute the local resources for that node. Service classes are added or removed dynamically based on user demands. Resource information is stored in distributed databases called warehouses. Each WOS node contains a local warehouse, housing information about local as well as remote resources. The entire set of WOS nodes constitute the WOSNet or WOSspace. Resource Management This system manages resources by adopting a decentralized approach for resource discovery and allocation. Allocation also involves software resources required for a service. Users should be registered in WOSNet and Hardware platform, operating system, programs and other resources should be declared for public access by a machine, for other constituents to use them. In order to reduce the overhead involved in conducting searches and resource trading, statistic methods are used to define a standard user space. This information includes the typical processes started by a user and the hosts preferred by the user to execute the services. The local system examines the requirements of a user job and determines if it can be executed with local resources. In the case of inadequate local resources, these requests 169 High Performance Distributed Computing are sent to the user resource control unit. This unit looks up the standard user space for fulfilling the request with due considerations for load sharing. If the standard user space proves to be insufficient, a search is launched by the Search Evaluation Unit, which also evaluates the results of search Architecture The services offered by WOS are accessed through a user interface, which also displays the result of execution. The Host Machine Manager performs the task of handling service requests, answering queries regarding resource availability and taking care of service execution. The User Manager with the help of knowledge obtained from local warehouse, allocates and requests resources needed by a service. Communication Protocols The discovery/location protocol (WOSRP) is responsible for the discovery of the available versions of WOSP. It locates the nodes supporting a specific version and connects to WOS nodes, which implement a specific version. The versions of WOSP differ only in the semantics they convey and hence a common syntax can be used for the purpose of transmission. The WOSP parser is responsible for syntax conversion. The WOSP Analyzer module is configured to support various versions of WOSP and the Figure 5.21 Architecture of a WOS Node information from the local warehouse is used to access a particular instance of the Analyzer. Figure 5.22 shown below represents the various functional layers. 170 High Performance Distributed Computing Figure 5.22 Services for Parallel/Distributed (PD) Applications and High Performance Computing A specialized class services PD applications and this specific version of WOSP is known as the Resource Discovery WOS Protocol (RD-WOSP). The PD application is assumed to be split into modules with separate performance and resource requirements for each module. The execution of each module is assigned to a WOS node capable of providing the necessary resources for that module. The PD application is thus mapped to a virtual machine consisting of many WOS nodes executing the various modules of the application. The services are invoked through the user interface or by calling the appropriate routines. The arguments to the routines are the service requested and the identity of the application requiring the service. The various service routines available for RD-WOSP are Discovery Service Routine: This routine locates the set of WOS nodes with the ability to satisfy the resource requirements of the application. Reservation Service Routine: This routine returns a value based on whether the reservation was granted or not. Setup Service Routine: The value returned by this routine is true if the setup of a module was successful and false if unsuccessful. The high performance constraints of an application are met with the help of a service class called High Performance WOSP (HP-WOSP). Like the RD-WOSP, it also offers services for discovery, reservation and setup. HP-WOSP is essentially an extension of RD-WOSP. An HP application is decomposed into a granularity tree with its root representing the entire application and the leaves representing atomic sequential processes. Each vertex of the tree can be treated as a module and has its own high performance constraints in terms of bandwidth, latency and CPU requirements. The HP171 High Performance Distributed Computing WOSP discovery service identifies a set of WOS nodes possessing the required resources to execute a subset of the granularity tree vertices. Factors such as load balancing and network congestion are considered while selecting the appropriate nodes. Thus, the Web Operating System acts as a meta-computing tool for supporting PD and High Performance applications. Trends in Distributed Operating Systems In this section, we explore the future of distributed operating systems. We do this by examining distributed operating systems currently being researched. SOMBRERO The first system we look at is Sombrero being researched at Arizona State University. The underlying concept is a Very Large Single Address Space distributed operating system. In the past, address spaces were limited to those that could be addressed by 32 bit registers (4 GB). With new computers, we now have the ability to manipulate 64 bit addresses. In a 64 bit address space, we could generate a 32 bit address space object once a second for 136 years and still not run out of unique addresses. With this much available address space, a system-wide set of virtual addresses is now possible. A virtual address is permanently and uniquely bound to every object. This address spans all levels of storage and is directly manipulated by the CPU; no address translation is required. All physical storage devices are viewed as caches for the contents of virtual objects. By utilizing a single address space, overhead is reduced for inter-process communications and file system maintenance and access. Furthermore, we eliminate the need for multiple virtual address spaces. All processor activities are distributed and shared by default. This provides for transparent inter-process communication and unified resource management. Threads may be migrated with no changes to address space. Security is provided by restricting access by OBJECT. Network Hardware IS the O.S. The third system is being researched by the University of Madrid. Their unique approach is to develop an adaptable and flexible distributed system where the Network Hard is considered the operating system. They begin with a Minimal Adaptable Distributed microkernel. Their goal is to “build distributed-microkernel based Operating Systems instead of microkernel based distributed systems.” The entire network is considered exported and multiplexed hardware instead of isolated entities. “Normal” micro-kernels multiplex local resources only while these adaptable micro-kernels multiplex local and REMOTE resources. The only abstraction is the shuttle which is a program counter and its associated stack pointer. The shuttle can be migrated between processors. Communication is handled by portals, a distributed interrupt line that behaves like active messaging. Portals are unbuffered. The user determines whether communication is 172 High Performance Distributed Computing synchronous, asynchronous, or RPC. While this system does not utilize a single address space, physical addresses may refer to remote memory locations. This is allowed by using a distributed software translation look-aside buffer (TLB). Virtually Owned Computers The last operating system is the Virtually Owned Computer being researched at the University of Texas at Austin. Each user in this system owns an imaginary computer-a virtual computer. The virtual computer is only a description of resources the user is entitled to and may not correspond to any real physical hardware components. The virtual computer consists of a CPU and a scheduling algorithm. Each user is promised a given quality of service or expected level of performance. Services received by the user are independent of the actual execution location. Summary In this chapter, we reviewed the main issues that should be considered for designing and evaluating distributed operating systems. These design issues are: system model and architecture, inter-process communications, resource management, name and file services, security, and fault tolerance. We have also discussed how some of these issues are implemented in representative distributed operating systems such as Locus, Amoeba, V, Mach and X-kernel. The future of distributing computing appears to include large single address spaces, unique global identifiers for all objects, and distributed adaptable microkernels. Distributed operating systems have also been characterized with respect to their ability to provide parallel and distributed computing over a Network of Workstations (NOW). In such an environment, one can compare distributed systems in terms of their ability to provide remote execution, parallel processing, design approach, compatibility, and fault tolerance [keet95]. Table 5.1 shows a comparison between previous distributed operating systems when they are used in NOW environment [keet95]. GLUNix Accent Amber Amoeba Butler Charlotte Clouds Condor Demos/MP Eden NetNOS Locus Mach NEST Newcastle Piranha Remote Ex. RX TR Mi Y Y Y Na Y Na Y Y Y Y Y N Y N N Y Y Y Y Y Y Y Yb Y Y Y Y Y Y Y Y Y N Y Y N Y Y Y Y Y N N N N Y Y N PJ Y N Y Y N Y Y N N Y Y N Y N N Y Parallel Jobs JC GS DR Y Y Y N N N N N N ? ? ? N N N N N N N N N N N N N N N N N N Y? N? N? N N N N N N N N N N N N Y N Y 173 Design IR DC Y Y N N N Y Y N Y Y Y Y N Y Y N Y Y N Y Y Y N Y Y Y Y Y N Y Y Y Compatibility UL EA HP Y Y Y N N N Y N N N N Y Y Y Y N N N N N N Y N Y N Y N Yc Y Y Y Y Y N Y Y N Y Y Yc Y Y Y Y Y Y N Y Fault Tol. FT CK Y Y Y N N N N Y N N N N N N Y Y N N Y Y Y N Y N Y N Y N Y N Y N High Performance Distributed Computing Plan 9 Sidle Spawn Sprite V VaxCluster Y Y Y Y Y Y Y N N Y Y ? N N N Y Y N N N N N Y N N N N N Y N N N N N N N N N N N N N Y Y Y Y Y Y N Y Y N Y Y N Y Y N N N N Y Y Y Yd Y Y Y Y Y Y N N Y Y N Y Y Table 5.1 Previous Work in Distributed Operating Systems NOW Retreat 174 N N N N N N High Performance Distributed Computing References [Dasg91] Dasgupta, LeBlanc Ahamad and Ramachandran, "The Clouds Distributed Operating System", Computer, Nov. 1991, pg.34. [Fort85] Fortier, Paul J., Design and Analysis of Distributed Real-time Systems. McGraw-Hill Book Company, 1985, pg 103. [Anan91] Ananda, A. L. and Srinivasan, B., "Distributed Operating Systems Distributed Computing Systems: Concepts & Structures, 1991,pp133-135. [Mull87] Mullender,J., "Distributed Operating Systems", Computer Standards & Interfaces, Volume 6, 1987, pp 37-44. [Walk83] Walker,.B G. Popek, R. English, C. Kline, G. Thiel, "The LOCUS Distributed Operating System", Proceedings of 1983 SIGOPS Conference, 1983, pp 49-70. [Mull90] Mullender, J.and Rossum, G. van and Tanenbaum, A.5 and Renesse,R van, "Amoeba, A Distributed Operating System for the 1990s",Computer, January, 1990, pp 44-51. [Rashid86] Rashid, Richard , "Threads of a New System," UNIX Review,August 1986 [Fitz86] Fitzgerald, Robert et al. "The Integration of Virtual Memory Management and Interprocess Communication in Accent," ACM Trans. On Computer Systems, May 1986 [Teva89] Tevanian, Avadis Jr. et al. "Mach The Model For Future UNIX," BYTE, November 1989 [Rash89] Rashid, Richard. "A Catalyst for Open Systems," Datamation, May 1989 [Blac90] Black, David L. "Scheduling Support for Concurrency and Parallelism in the Mach Operating System," IEEE Computer, May 1990 [Byan91] Byant, R. M. et al. "Operating system support for parallel programming on RP3," IBM J. RES. DEVELOP. Vol. 35 No5/6,September/November 1991 [Andr92] Andrew 5. Tanenbaum. "Modern Operating System," Prentice Hall, N.J., 1992 [Hutc89] N. C. Hutchinson, L. L. Peterson, M. B. Abbott, and S. O'Malley, "RPC in the x-kernel: Evaluating New Design Techniques'', Proceedings of the Twelfth Symposium on Operating Systems Principles, 23(5), 91--101, In ACM Operating systems Review 23 (5). 175 High Performance Distributed Computing [Anan91] Ananda, A. L. and Srinivasan, B., "Distributed Operating Systems Distributed Computing Systems: Concepts & Structures, 1991,pp133-135. WebOS: Operating System Services For Wide Area Applications, Amin Vahdat, Thomas Anderson, Michael Dahlin, David Culler, Eshwar Belani, Paul Eastham, and ChadYoshikawa.July1998 The Seventh IEEE Symposium on High Performance Distributed Computing WebFS: A Global Cache Coherent File system, Amin Vahdat, Paul Eastham, and Thomas Anderson. December 1996 Technical Draft The CRISIS Wide Area Security Architecture, Eshwar Belani, Amin Vahdat, Thomas Anderson,andMichaelDahlin.January1998 Proceedings of the 1998 USENIX Security Symposium G. Kropf, «Overview of the Web Operating System project», High Performance Computing Symposium 99. The Society for Computer Simulation International, San Diego, CA, pp. 350 356, April 1999 N. Abdennadher, G. Babin and P.G. Kropf. «A WOS based solution for high performance computing», IEEE-CCGRID 2001, Brisbane, Australia, pp. 568573, Mai 2001 2K: A Distributed Operating System for Dynamic Heterogeneous Environments, Fabio Kon, Roy Campbell, M. Dennis Mickunas, Klara Nahrstedt, and Francisco J. Ballesteros. 9th IEEE International Symposium on High Performance Distributed Computing. Pittsburgh. August 1-4, 2000 Dynamic Resource Management and Automatic Configuration of Distributed Component Systems Fabio Kon, Tomonori Yamane, Christopher Hess, Roy Campbell, and M. Dennis Mickunas. Proceedings of the 6th USENIX Conference on Object-Oriented Technologies and Systems (COOTS'2001), San Antonio, Texas, January, 2001 176 Chapter 5 Architectural Support for High-Speed Communications 165 166CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 5.1 Introduction With the development of high speed optical bers network speeds are shifting towards gigabits per second. Given this bandwidth the slowest part of computer communication is no longer physical transmission. In order to eectively utilize the bandwidth available in high speed networks, computers must be capable of switching and routing packets at extremely high speeds. This has moved the bottleneck from physical transmission to protocol processing. There are several challenging research issues that need to be addressed in order to solve the protocol processing bottleneck so that high performance distributed systems can be designed. Some of these issues are outlined below: 1. The host-to-network interface imposes excessive overhead in the form of processor cycles, system bus capacity and host interrupts. Data moves at least twice over the system bus 15]. The host is interrupted by every received packet. With bursty trac, the host barely has time to do any computation jobs while receiving packets. Further, with synchronous send/receive model, the sender blocks until the corresponding receive is executed. There is no overlap between computation and communication. In the asynchronous send/receive model, messages transmitted from sender are stored in the receiver's buer until they are read by the host in the receiving side. Since the operating system is involved in message receiving, the asynchronous send/receive model also brings heavy overhead. Therefore, the interface should be designed to o-load communication tasks from the host 15], to reduce the number of times data is copied and to increase the overlap of computation and communication as much as possible. 2. Conventional computer networks use shared medium architectures (ring or bus). With the increasing number of users and the number of applications that have intensive communication requirements, the shared medium architectures are unlikely to support low latency communications. Switch-based architectures, that allow several message passing activities to exist simultaneously in the network, should be considered as a potential alternative to replace the shared medium architectures 19]. 3. A necessary condition for low-latency communication in parallel and distributed computing is that ne-grain multiplexing be supported eciently in time-multiplexing network resources. A typical method of achieving ne-grain multiplexing is to 5.1 INTRODUCTION 167 split each message into small xed size data cells, as in ATM where messages are transmitted and switched in a sequence of 53-byte cells as discussed in Chapter 3. Currently, the techniques that have been proposed to address these problems focus on improving one or more of the main components (networks, protocols, and network interfaces) of the communication layer of the distributed system reference model. Faster networks can be achieved by using high speed communication lines (e.g., ber optics) and high speed switching devices (e.g, ATM switches). High speed communication protocols can be developed by a combination of one or more of three techniques: developing new high speed protocols, improving existing protocol structure, and implementing protocol functions in hardware. Faster network interfaces can improve the performance of the communication system by o-loading the host from running the protocol tasks (e.g., data copying and protocol processing) and run them instead on high speed adapter boards. Figure 5.1 summarizes the techniques that have been proposed in the literature to improve the communications subsystems. High-Speed Communications (Layer - 1) High-Speed High-Speed (chap. 3) Networks LINs * HIPPI LANs * ATM (chap. 4) Network Interface /Switches Protocols MANs WANs * FDDI * DQDB * SMDS * Frame Relay * ATM * Fiber Channel * ATM (chap. 2) New Protocols VMTP XTP Improved Structure XTP Hardware Implementation High Speed Switching (chap. 5) High-Speed Adapters Protocol Engine STM Switche CAB HOPS (fabric) NAB NETBLT HIP Figure 5.1: Research Techniques in High-Speed Communications. In this chapter we focus on architectural support for a high-speed communication system that has a signicant impact on the design of high performance distributed 168CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS systems. We discuss important design techniques for high speed switching devices and host network interfaces, and hardware implementations of standard transport protocols. 5.2 HIGH SPEED SWITCHING 5.2 High Speed Switching 169 An important component of any computer network is the interface communication processor, which is also referred to as a switch or fabric. The function of the switch is to route and switch packets or messages oating in the network. In this section, we will address the design architectures of a high speed ATM switch. The design of ATM packet switches that are capable of switching relatively small packets at rates of 100,000 to 1,000,000 packets per second per line is a challenging task. There have been a number of ATM switch architectures proposed, many of which will be discussed in the following sections. Also, the various performance characteristics and implementation issues will be discussed for each architecture. Architectures for ATM switches can be classied into three types: shared memory, shared medium and space-division. An ATM packet switch can be viewed as a black box with N inputs and N outputs, which routes packets received on its N inputs to N outputs based upon the routing information stored in the header of the packet. For the switches covered in the following sections, the following assumptions are made: All switch inputs and outputs have the same transmission capacity (V bits=s). All packets are the same size. Arrival times of packets at the switch inputs are time-synchronized and thus the switch can be considered a synchronous device. 5.2.1 Shared Memory Architectures Conceptually, a shared memory ATM switch architecture consists of a set of N inputs, N outputs and a dual-ported memory as shown in Figure 5.2. Packets received on the inputs are time multiplexed into a single stream of data and written to packet rst-in, rst-out (FIFO) queues formed in the dual-ported memory. Concurrently, a stream of packets is formed by reading the packet queues from the dual-ported memory. The packets forming this data stream are demultiplexed and written to the set of N outputs. Shared memory architectures are inherently free of internal blocking which is characteristic of many Banyan based and space division based switches. Shared memory architectures are not free from output blocking, though. It is possible during any time slot that two packets input to the switch may be destined for the same output. 170CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS N Inputs Shared Memory N Outputs Control Transfer rate of each input and output = V bits/second Figure 5.2: Shared Memory Switch Architecture Because of this probability and because the data rate of the switch outputs is commonly the same as the switch inputs there must be buers for each output or packets may be lost. The output buers for the shared memory switch is the shared memory itself. The shared memory acts as N virtual output FIFO memories. The required size of each of the output buers is derived from the desired packet loss probability for the switch. By modeling the expected average and peak loads seen by the switch, buer size can be calculated to insure packet loss probabilities. The shared memory concept is simple but suers from practical performance limitations. The potential bottleneck in the implementation of this architecture is the bandwidth of the dual-ported memory and its control circuitry. The control circuitry which directs incoming packets into virtual FIFO queues must be able to determine where to direct N incoming packets. If the control circuitry cannot keep up with the ow of incoming packets, packets will be lost. Additionally, the bandwidth of the dual-ported memory must be large enough to allow for data transfers of N packets times the data rate V of the ATM switch ports for both the input and output. Given this criteria the memory bandwidth must be 2NV bits/sec. The number of ports and the port speeds of a switching module using a shared memory architecture is bounded by the bandwidth of the shared memory and the control circuitry used in 5.2 HIGH SPEED SWITCHING 171 its implementation. Construction of switches larger than what can be built given the hardware limitations can be done by interconnecting many shared memory switching modules in a multistage conguration. A multistage conguration trades switch packet latency for a greater number of switch ports. 5.2.2 Shared Medium Architectures A shared medium ATM switch architecture consists of a set of N inputs, N outputs, and a common high speed medium such as a parallel bus as shown in Figure 5.3. Incoming packets to the switch are time multiplexed onto this common high speed medium. Each switch output has an address lter and FIFO to store outgoing packets. As the time multiplexed packets appear on the shared medium, each address lter discards packets which are not destined for that output port and propagate those which are. Packets which are passed by the address lter are stored in the output FIFO and then transmitted on the network. The shared medium ATM switch architecture is very similar to the shared memory switch architecture in that incoming packets are time multiplexed onto a single packet stream and then demultiplexed into separate streams. The dierence in the architectures lies in the partitioning of the storage memory for the output channels. In the shared memory architecture, each parallel channel utilizes the dual-ported memory for packet storage while in the shared medium architecture each output port has its own storage memory. The memory partitioning of the shared medium architecture implies the use of a FIFO memory as opposed to a dual-ported memory in its implementation. For a shared medium switch, the number of ports and the port speeds of a switching module is bounded by the bandwidth of the shared medium and FIFO memories used in its implementation. The aggregate speed of the bus and FIFO memories must be no less than NV bits/second or packets may be lost. Construction of switches larger than what can be built, given the hardware limitations, by interconnecting many switch modules in a multistage conguration. Shared medium architectures, like the shared memory architectures, are inherently free of internal blocking which is characteristic of many Banyan based and space division based switches. Shared memory architectures are not free from output blocking. There must be buers for each output or packets may be lost. The buers in this architecture are the FIFO memories. The required size of each of the output buer FIFOs is derived from the desired packet loss probability for the switch. By modeling 1 Serial to Parallel 2 Serial to Parallel N Inputs N Serial to Parallel Time Division Bus 172CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Address Filter FIFO Parallel to Serial 1 Address Filter FIFO Parallel to Serial 2 N Outputs Address Filter FIFO Parallel to Serial N Transfer rate of each input and output = V bits/second Figure 5.3: Shared-Medium Switch Architecture the expected average and peak loads seen by the switch, FIFO buer size can be calculated to insure packet loss probabilities. It should be noted that the shared medium and shared memory architectures are guaranteed to transmit packets in the same order in which they were received. The shared medium and shared memory architectures both suer from memory and medium bandwidth limitations. This limitation inhibits the potential speed and number of ports of switches designed around these architectures. They also have packet loss rates which are dependent on the memory size of the switch. 5.2.3 Space Division Architectures Space division ATM switch architectures are dierent from the shared memory and shared bus architectures. In a space division switch, concurrent paths are established from the switch inputs to the switch outputs, each path with a data rate of V bits/second. An abstract model for a space division switch is shown in Figure 5.4. Space division architectures avoid the memory bottleneck of the shared memory and shared medium architectures since no memory component in the switching fabric has to run at a rate higher than 2V. Another distinct feature of the space 5.2 HIGH SPEED SWITCHING 173 Buffers Routers Outputs Inputs Concentrators Transfer rate of each input and output = V bits/second Figure 5.4: Abstract Model for Space Division Switches division architecture is that the control of the switch need not be centralized, but may be distributed throughout the switching fabric. This type of architecture, however, exchanges memory bandwidth problems for problems unique to space division architectures. Depending on the particular internal switching fabric used and the resources available to establish paths from the inputs to the outputs, it may not be possible for all required paths to be set simultaneously. This characteristic, commonly referred to as internal blocking potentially limits the throughput of the switch and thus becomes the central performance limitation of space division switch architectures. A related issue to space division architectures is buering. In fabrics exhibiting internal blocking, it is not possible to buer packets at the outputs, as is possible in shared memory and shared bus type switches. Instead, buers must be located at places where potential conicts along paths may occur, or upstream to them. Ultimately buers may be placed at the inputs of the switch. The placement of buers has an important eect on the performance of a space division switch as well as on its hardware implementation. 174CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Crossbar Switch A crossbar switching fabric consists of a square array of 2N crosspoint switches, one for each input-output pair, as shown in Figure 5.5. A crosspoint switch is a transmission gate which can assume two states, the cross state and the bar state. Assuming that all crosspoint switches are in the cross state, to route a packet from input line i to output line j it is sucient to set the (i,j)th switch to the bar state and leave the switches (i,k), k=1,2,...,j-1 and the switches (k,j), k=i+1,...,N in the cross state. The state of any other switch is irrelevant. 1 Bar State 2 Inputs 3 4 Cross State 1 2 3 4 Transfer rate of each input and output = V bits/second Figure 5.5: Crossbar Fabric In the crossbar fabric, unlike the shared medium and shared memory switches, there is no need for output buering since there can be no congestion at the output of the switch. In this architecture buering at the output is replaced with buering at the input. One dierence between the output buering of shared medium and shared memory switches and the input buering of the crossbar switch is that the memory bandwidth of the crossbar switch need only meet the transfer rate V of the switch input, not NV as is found in the shared medium and shared memory switches. This diminishes the performance limitation due to the memory bandwidth as found in the shared medium and shared memory switches 5.2 HIGH SPEED SWITCHING 175 As long, as there is no output conict, all incoming packets can reach their respective destinations, free from blocking, due to the existence of 2N crosspoints in the fabric. If, on the other hand, there exists more than one packet in the same slot destined to the same output, then only one of these packets can be routed to that output due to the contention for the same arc in the fabric. The remaining packets must be buered at the input or dropped somewhere within the switch. The size of the input buers for a crossbar fabric have a direct eect on the packet loss probability of the switch in the same way the size of the output buers eect the packet loss probability of the shared memory and shared medium switches. Similarly to the shared memory and shared medium switches, by modeling the expected average and peak loads seen by the switch, input buer size can be calculated to insure packet loss probabilities. The limitation of a crossbar switch implementation is that it requires 2N crosspoints in the fabric and therefore its realizable size is limited. Crossbar fabrics also have the drawback of not having constant transit times for all input/output pair combinations. Also, when self routing is used in this architecture the processing performed at each crosspoint requires knowledge of the complete port address, another drawback. Knockout Switch Architecture A knockout switch ATM architecture, as illustrated in Figure 5.6, consists of a set of N inputs and N outputs. Each switch input has associated with it, its own packet routing circuit which routes each input packet to its destined output interface. Each output interface contains N address lters, one for each input line, which drop packets not destined for that output port. The outputs of each of the address lters for a particular switch output are connected to an NxL concentrator which selects up to L packets of those passing through the N address lters in a given slot. If more than L packets are present at the input of the concentrator in a given slot, only L will be selected and the remaining packets will be dropped. The use of the NxL concentrator simplies the output interface circuitry by reducing the number of output buer FIFO memories and the control circuit complexity without exceeding the required packet loss probability of the switch. For uniform trac a packet loss rate of 10-6 is achieved with L as small as 8, regardless of the switch load and size14]. The Knockout switch architecture is practically free of internal blocking. The potential for packet loss in the NxL concentrator implies the presence of internal blocking since the presence of L+1 packets at the input of the NxL concentrator will 176CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 1 2 Broadcast Input busses Inputs N Address Filter Address Filter Address Filter NxL Concentrator Shared Medium Output Buffer 1 2 N Outputs Figure 5.6: Knockout Switch basic structure 5] result in packet loss and this loss could conceivably be diminished with the use of buers at the input of the NxL concentrator. The Knockout switch architecture is not free from output blocking and therefore must provide buering for each output or packets may be lost. The buers in this architecture are the FIFO memories. The required size of each of the output buer FIFOs is derived from the desired packet loss probability for the switch. By modeling the expected average and peak loads seen by the switch, FIFO buer size can be calculated to insure packet loss probabilities. The bandwidth requirements of the Knockout switch is only LV as compared to the NV bandwidth requirement of the shared medium switch. This would yield a potential performance gain of N/L for the Knockout switch over the shared medium switch where memory bandwidth is concerned. Although the Knockout switch is a space division switch like the crossbar switch, each of the inputs has a unique path to the output buers much like the shared medium switch and thus circumvents the internal blocking problem. The lack of internal blocking sets the Knockout switch apart from other space division switch architectures like the crossbar switch architecture. 5.2 HIGH SPEED SWITCHING 177 Integrated Switch Architecture In an NxN Integrated Switch fabric a binary tree is used to route data from each switch input to a shift register contained in an output buer as shown in Figure 5.7. Each shift register is equal in size to one packet. During every time slot, the contents of all N registers corresponding to a given output line are emptied sequentially into an output FIFO memory. This function is performed by a multiplexor running at N times the input line rate 6]. 1 1 SR SR N 1 N 1 SR N N 1 1 FIFO Control SR SR SR N N N N SR N SR FIFO Control Figure 5.7: Integrated Switch Fabric 8] The Integrated switch architecture is not free from output blocking and therefore must provide buering for each output or packets may be lost. The required size of each of the output buer FIFOs is derived from the desired packet loss probability for the switch. By modelling the expected average and peak loads seen by the switch, FIFO buer size can be calculated to insure packet loss probabilities. This architecture, while still a space division architecture, is similar to the shared medium architecture if the self routing tree and packet storage shift registers are thought of as the shared medium. The FIFO memory and the circuitry multiplexing each of the packets stored in the shift registers associated with each output of the Integrated switch must operate at a rate NV bits/second much like the shared medium 178CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS architecture output FIFO memories. This performance limitation inhibits the potential speed and number of ports of switches designed around these architectures. It also has a packet loss rate which are dependent on the memory size of the output FIFO memory. It should be noted that the Integrated switch fabric is guaranteed to transmit packets in the same order in which they were received.. Banyan-based fabrics Alternatives to the crossbar switch implementation have been based on multistage interconnection networks, generally referred to as Banyan networks. A multistage interconnection network for N inputs and N outputs, where N is a power of 2, consists of log2N stages each comprising N/2 binary switching elements, and interconnection lines between the stages placed in such a way as to allow a path from each input to each output as shown in Figure 5.8. 0 0 1 1 2 3 2 3 4 5 4 5 6 7 6 7 8 8 9 9 10 11 10 11 12 13 12 13 14 15 14 15 Bar State Cross State Figure 5.8: Banyan Interconnection Network 5]. The example is constructed by starting with a binary tree connecting the rst input to all N outputs and then proceeding with the construction of similar trees for the remaining inputs, always sharing the binary switches already existing in the network to the maximum extent possible. 5.2 HIGH SPEED SWITCHING 179 An NxN multistage interconnection network possesses the following properties: There exists a single path which connects an input line to an output line. The establishment of such a path may be accomplished in a distributed fashion using a self-routing procedure. All networks allow up to N paths to be established between inputs and outputs simultaneously. The number of paths is a function of the specic pattern of requests that is present at the inputs. The networks also possess a regular structure and exhibit a modular architecture. The drawback to Banyan type switches is internal blocking and the resulting throughput limitations. Simulation results have shown that the maximum throughput attainable for a Banyan switch in much lower than for that of a crosspoint switch. Throughput diminishes with increasing numbers of switch ports for the Banyan switches. All switch architectures based on Banyan switches are distinguished by their means of overcoming the shortcomings of internal blocking and increasing a switches throughput and packet loss performance 6]. One way to enhance the Banyan architecture is to place buers at the points of routing conicts. Switches of this type are referred to as Buered-Banyan switches. Another method of enhancement involves the use of input buering and of blocking packets at the inputs based upon control signals preventing blocking. Performance can also be improved by sorting input packets in order to remove output conicts and presenting to the Banyan router, packet permutations which are guaranteed not to block. Packets not routed in a time slot using method are buered and retried. This method is referred to the Batcher-Banyan switching fabric 5]. Buered-Banyan Fabrics In a Buered-Banyan switch, packet buers are placed at each of the inputs of the cross-point switches. If a conict occurs while attempting to route two packets through the Banyan fabric, only one of the conicting packets is forwarded. The other packet remains in the buer and routing is retried during the next time slot. The size of the buer used in this implementation has a direct, positive eect on the throughput performance of the switch and its size is chosen to improve the packet loss performance of the switch. The positive eect of adding buers to the front end of a Banyan switch is diminished if internal conicts persist within the interconnection network. An illustration of the potential internal routing congestion is shown in gure 8. This routing congestion problem is especially true where two heavily loaded paths through the interconnection network need to share internal link also illustrated in Figure 5.9 180CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Area of Packet Congestion Inputs Outputs Area of Packet Congestion Figure 5.9: Congestion in a Buered-Banyan switching fabric To diminish the the eect of internal link congestion a distribution network can be placed on the front end of the routing network. A distribution network is a Banyan network used to distribute incoming packets across all of the switch inputs. This is done by alternately routing packets to each of the outputs of every crosspoint in the switch, paying no regard to the destination address of the packet. The combination of distribution network and routing network is equivalent to a Benes interconnection network as shown in Figure 5.10 11]. The number of possible paths through a Benes network is greater than one, thus oering the potential for reducing the number of internal conicts during packet routing. If routing requests are made which result in an internal conict within the network the routing through the network can be rearranged to eliminate the conict. This property makes the Benes network rearrangably non-blocking 6]. Batcher-Banyan Fabrics An alternative way of improving the throughput of a self-routing Banyan network is to process the inputs before introducing them into the network, as is done in the Batcher-banyan switch fabric shown in Figure 5.11 ?]. The Batcher-banyan network is based on the following property. Any set of k packets, k # N, which is free of 5.2 HIGH SPEED SWITCHING Inputs 181 Outputs Figure 5.10: An 8x8 Benes network output conicts, sorted according to output addresses and concentrated at the top k lines is realizable by the OMEGA network 6]. Incoming packets are are sorted according to their requested output addresses by a Batcher sorter, which is based on a bitonic sorting algorithm and has a multistage structure similar to an interconnection network 12]. Packets with conicting output addresses are removed. This is accomplished with a running adder, referred to usually as the \trap network". Remaining packets are concentrated to the top lines of the fabric. This may be accomplished by means of a reverse OMEGA network. Concentrated packets are then routed via the OMEGA network. Packets which are not selected by the trap network are recirculated, being fed back into the fabric at later slots. A certain number of input ports, M, are reserved for this purpose, thus reducing the number of input/output lines the switching fabric can serve. Since the number of recirculated packets may exceed M, buering of these packets may still be required. M and the buer size are selected so as to not exceed a given loss rate 5]. Multiple Banyan Switching Fabrics There are two ways of using multiple banyan switches, in series and in parallel. The use of multiple Banyan switches helps overcome the problem of internal blocking 182CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Shared Recirculating Queue M Inputs N M Batcher Sorting Network N+M Trap N+M Concentration N Network Network Banyan N Outputs Network Figure 5.11: Batcher-banyan switch architecture by providing more paths between inputs and outputs. As shown in ?, ?] multiple Banyan networks can be used in parallel to reduce the input load on each network. By reducing the input load of a Banyan network the probability of internal conicts is reduced, thus increasing the potential throughput of the switch. The outputs of the parallel Banyan networks are merged into output port buers. The throughput of the multiple parallel banyan switch improves as the number of parallel networks are increased. The tandem Banyan Switching fabric as introduced in ?] overcomes internal blocking and achieves output buering without having to provide 2N disjoint paths. It consists of placing multiple copies of the Banyan network in series thus increasing the number of realizable concurrent paths between inputs and outputs. The switching elements are also modied to operate as follows. Upon a conict between two packets at some crosspoint, one of the two packets is routed as requested and the other packet is marked and routed the other way. At the output of the rst Banyan network packets which were routed correctly are placed in output port buers and those packets marked as incorrectly routed are unmarked and placed on the inputs of the second banyan fabric for further processing. This process is repeated through K banyan networks in series. Note that the load on successive banyan networks decreases and 5.2 HIGH SPEED SWITCHING 183 therefore so does the likelihood of internal conicts. By choosing a suciently large K, it is possible to increase the throughput and decrease the packet loss rate to the desired levels. 184CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 5.3 Host Network Interface The host-to-network interface imposes excessive overhead in the form of processor cycles, system bus capacity and host interrupts. The host is interrupted by every received packet. With bursty trac, the host barely has time to do any computation jobs while receiving packets. High-performance distributed systems can not provide high throughput and low latency for their applications unless the host-network interface bottleneck has been resolved. The current advances in the I/O buses that can operate at 100 Mbytes/second, ecient implementations of standard protocols and the availability of special-purpose VLSI chips (e.g., HIPPI chips, ATM chips) are not by themselves sucient to solve the host-network interface problem 29]. There has been intensive research eorts have been published to characterize the overheads related to computer network communications ?]. However, it is generally agreed on that there is no single source of overhead and one needs to streamline all communication functions required to transfer information between a pair of computers. For example, Figure 5.12 shows the communication functions required to send messages over the the socket interface 29]. These functions can be grouped in three classes: 1. Application Overhead: This represents the overhead incurred by using socket system calls to setup the socket connection (for both connectionless and connection oriented service) between the sender and the receiver. This overhead can be reduced by using lightweight protocols that do not involve heavy overhead to setup using a permanent or semipermanent connection between the communicating computers. 2. Packet Overhead: This represents the overhead encountered when sending or receiving packets (e.g. TCP, UDP, IP, medium access protocol, physical layer, and interrupt handling). This overhead can be improved by using lightweight protocols and/or running these functions on the host network interface. This will allow the host to spend more time processing application tasks instead of performing CPU intensive protocol functions. 3. Data Overhead: This represents the overhead associated with copying and checksumming the data to be transferred or received. This overhead increases with the increase in the data size. When the network is operating at high speed, this overhead will be become the dominating overhead especially when the size of data to be transferred is large. The main limiting resource is the 5.3 HOST NETWORK INTERFACE 185 Application System call Socket processing System call Socket processing Copy data to system buffers Copy data to user space TCP Protocol Processing TCP Protocol Processing Calculate Data Checksum Calculate Data Checksum IP Processing IP Processing Create MAC Header MAC Processing Access Device Interrupt Network Figure 5.12: Communication functions for socket network interface 186CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS memory-bus bandwidth and thus to reduce this overhead, one needs to reduce the number of bus cycles required for each data transfer. Figure 5.13 depicts the data movement that occurs during the transmission of a message the inverse path is followed when a message is received. The dashed line indicates the checksum calculation. For the traditional host network interface, the bus is accessed ve times for every transmitted word. This number of accesses will even be increased further if the host writes the data rst to a device buer before it is sent to the network buer. The number of transfers can be reduced to three by writing the data directly to the interface buer instead of using the system buer (Figure 5.13(b)). In this case the checksum is computed as the data being copied. In addition to reducing the bus overhead, the use of the interface buer will allow the transmission of packets at the network speed, independent of the speed of host bus system. The use of DMA can even reduce the number of transfers to two (see Figure 5.13(c)). The use of DMA provides the capability to support burst transfers. 5.3.1 Host Network Interface Architectures The host-network interface should consume fewer CPU and bus cycles so it can communicate at higher rates and thus can allocate more CPU cycles for application processing. Also, the architecture of host-network interface should be able to support a variety of computer architectures (workstations, parallel and supercomputers, and special-purpose parallel computers). Furthermore the host interface should cost only a small fraction of the host itself and should be able to run eciently standard and lightweight communication protocols. The existing host-network interface architectures can be broadly grouped into four categories 30]: operating system based DMA interfaces, user-level memory mapped interfaces, user-level register mapped interfaces, and hardwired interfaces. 1. OS-Level DMA-Based Interfaces The DMA interface handles the transmission and the receiving of messages under the control of the operating system. At the hardware level, the send and receive of messages starts by initiating a DMA transfer between the main memory and the network interface. At the software level, the transmission of a message is carried out by copying the message into the memory and then making a SEND system call which initiates the DMA transfer from the memory to the network interface buer. Similarly, the receiving of messages require the 5.3 HOST NETWORK INTERFACE Application System 187 Application System Network Interface Network Data Copy Network Interface Copy Network Data (a) (b) Application System Network Interface Copy Network Data (c) Figure 5.13: Data ows (a) in traditional network interface, (b) in network interface with outboard buering, (c) in network interface with DMA. 188CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS program at the receiving side to issue a RECEIVE system call. Since the operating system is involved to handle sending and receiving of messages, the latency can be quite high, especially when UNIX Socket system calls are used. Ecient network interface designs should avoid getting the operating system involved in sending and receiving messages as much as possible. However, the interface designs should provide protection means among dierent applications since the operating system is not involved in the transfer of messages over the network. 2. User-Level Memory-Mapped Interfaces: More recent processor-network interface designs make sending and receiving messages user level operations. The salient feature of this technique is that the network interface buer can be accessed in a similar latency to that of accessing the main memory buer. In this scheme, the user process is responsible for composing the message and executing the SEND or RECEIVE command. The host is notied about the arrival of a message by either polling the status of the network interface or by interrupt. The host-network interface buer can be colocated within the main memory or connected directly to the memory buses of the host. 3. User-Level Register-Mapped Interface: The memory mapped network interface design can be further improved by replacing the buer by a set of registers. The transmission of a message requires several store operations to the memory mapped network interface buer. Similarly, the receiving of a message requires several load operations. By using on-chip registers, we can eliminate many of these store and load operations by mapping the interface into the processor's register le. Hence, an arrived message could be stored in a predetermined set of registers. Similarly, the data of an outgoing message could be stored directly into other predened set of general registers. Since the registers can be accessed in a lower latency than accessing the memory buer, the mapping of the processor-network interface into the register le can achieve low-overhead and high-bandwidth communication. 4. Hardwired Interface In this scheme, hardware mechanisms are used to bind the send and receive of messages as well as the interpretation of incoming messages. This scheme is usually adopted in systems that are built around shared memory and/or 5.3 HOST NETWORK INTERFACE 189 dataow models rather than the general message passing model. This scheme is not suitable for a general message passing model because the user and compiler have not control of the process of sending and receiving messages. This is process is xed and done completely in hardware and thus can be changed to optimize performance. 5.3.2 Host-Network Interface Examples A Communication Acceleration Block (CAB) A host-network interface architecture, Communication Acceleration Block (CAB), is designed by a group of researchers at Carnegie Mellon University. The goals of the CAB design are to minimize data copies, reduce host interrupts, support DMA and hardware checksumming, and control network access. This host architecture is applicable to a variety of computer architectures (supercomptuers, special purpose parallel comptuers, iWARP, and workstations). Registers Host Bus Int. Host Bus Int. MAC Network SDMA Check summ Memory Registers MDMA MAC Network SDMA Memory MDMA Check summ N e t w o r k N e t w o r k Figure 5.14: Block diagram generic network interface Figure 5.14 shows a block diagram of the CAB architecture. The CAB consists mainly of two subsystems: transmit and receive. The network memory in each subsystem is used for message buering and can be implemented using Video RAM 190CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS (VRAM). System Direct Memory Access (SDMA) is used handle data transfer between the main memory and network memory, whereas the media DMA (MDMA) handles the transfer of data between the network media and the network memory. Since TCP and UDP protocols place the checksum in the packet header, the checksum of a transmitted packet is calculated when the data is written into network memory and then placed in the header by the CAB in a location that is specied by the host as part of the SDMA request. Similarly, the checksum of an incoming packet is calculated when the data ows from the network into network memory. To o-load the host from controlling the access to the network medium, the CAB hardware performs the Medium Access Control (MAC) of the supported network under the control of the host operating system. It is based on multiple \logical channels," queues of packets with dierent destinations. The CAB attempts to send a packet from each queue in a round-robin fashion. The exact MAC is controlled by the host through retry frequency and time-out parameters for each logical channels. The register les on both the transmit and receive subsystems are used to queue host requests and return tags. The host interface implements the bus protocol for the specic host. The transmit and receive subsystems can either have their own bus interface or they can share a bus interface. Nectar Network Nectar is a high-speed ber-optic network developed at Carnegie Mellon University as a network backplane to support distributed and heterogeneous computing 34, 35, 36]. The Nectar system consists of a set of host computers connected in an arbitrary mesh via crossbar switches (hubs). Each host uses a communication processor (CAB: Communication Accelerator Board) as its interface to the Nectar network. A CAB is connected to a hub using a pair of unidirectional ber optic links. The network can be extended arbitrarily by using multiple hubs, where the hubs can be interconnected in any desired topology using ber optic pairs identical to those used for CAB{ hub interconnections. The network supports circuit switching, packet switching, and multicast communication through its 100 Mbps optical links. There are three major blocks of the CAB: processing unit, network interface, and host interface as shown in Figure 5.15. All three blocks have high-speed access to a packet memory. The processing unit consists of a processor, program memory, and supporting logic such as timers. The network interface consists of ber optic data links, queues for buering data streams, DMA channels for transmission and reception, and associated control and status logic. The host interface logic includes Figure 5.15: Block diagram of the Nectar CAB slave ports for the host to access the CAB, DMA controller and other logic. The packet memory has sucient bandwidth capability to support the accesses by the three major blocks to the packet memory at high bandwidth to keep up with the speed of the ber. The accesses to the packet memory by the dierent blocks are arbitrated at each cycle using an ecient robin-mechanism to avoid conicts. The CAB design provides exible and programmable features the Nectar network. The Nectar network can be used as a conventional high-speed LAN by treating the CAB as a network device. The CAB can be used as a protocol processor by ooading transport protocol processing from the host processor. Various transport protocols implemented including TCP and TP4. Another feature of the CAB is that the application interface is provided by a programming library called Nectarine so that part of application code can be executed on the CAB. 192CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 5.3.3 A tightly coupled processor-network Interface The approach presented here developed by a group of researchers at Massachusetts Institute of Technology and is based on user-level register-level mapped interface 30]. In this approach, the most frequent operations such as dispatching, forwarding, replying and testing for boundary conditions are done by hardware means and by mapping network interface into the processor's register le. This approach is suitable for short messages and may not be ecient to handle large message sizes. Figure 5.16 shows the programmer's view of the interface that consists of 15 interface registers together with an input message queue and an output message queue. The interface uses ve output registers to send messages, o0 through o4, ve input registers to receive messages, i0 through i4, the CONTROL register contains values to control the operation of the network interface, the STATUS bits indicate the current status of the network interface, and the remaining registers are used for optimizing message dispatch. Input and output queues are used to buer messages being received or transmitted, respectively. The message is typically assumed to be short and consists of ve word, mo through m4 and a 4-bit type eld for optimization. The logical address of the destination processor is specied by the high bits of the rst word. The network interface is controlled by SEND and NEXT commands. The SEND command queues messages from output registers into the output queue whereas the NEXT command stores messages from the input queue into the input registers. The network interface, together with the network, enforces ow control at the sending processor. If the message is long and can not t into ve words, the architecture can be extended to to send and receive variable length messages by using the input and output registers as scrolling windows. To achieve this, two commands: SCROLL-IN for incoming message and SCROLL-OUT for sending message are used. The interface design provides some support to handle in a privileged manner some important messages (e.g., those are destined to the operating system). This is done by allowing the incoming privileged message to interrupt the host or to be stored in a privileged memory location until the host is free to process that message. The performance of the basic architecture shown in Figure 5.16 can be further improved by applying several renements: 1) use a 4-bit message type identier instead of using 32-bit identier 2) avoid the copying overhead of copying the common parts of the message elds that will be used in replying to or forwarding part of a message. This is done through the use two special modes of the SEND command, REPLY and FORWARD. The SEND command composes an outgoing message using certain input registers in place of certain output registers, thus removing the need 5.3 HOST NETWORK INTERFACE 193 Processor Interface Interface Registers to/from Proc. o0 o1 o2 o3 o4 i0 i1 i2 i3 i4 Control Status IpBase MsgIP NextMsgIP Network Output ... to Network Message Output Queue Network Input ... Message input Queue from Network Figure 5.16: User-level register mapped interface. to copy and 3) To compute the instruction address of the handler for the incoming message, MsgIp register (See Figure 5.16) precomputes in hardware. To compute MsgIp, the network interface replaces certain bits of the IpBase register with the type bits of the arrived message. Another register NextMsgIp overlaps the processing of one message with the dispatching of the next message. It computes the handler address for the next message, just as the MsgIp for current message. This interface design can be implemented in several dierent ways: O-Chip Cache-Based Implementation This implementation maps the networks interface into a second level o-chip cache (see Figure 5.17 (a)). Thus this interface becomes another data cache chip on the processor's external data cache bus. This interface is easy to implement it does not require modications of the processor chip. But it is slower than an on-chip interface. On-Chip Cache-Based Implementation This implementation is identical to the previous one, except that the network interface sits on an internal data cache bus rather than an external one (see Figure 5.17 (b)). Although the network interface is added into the processor chip, it does not modify the processor core, and it only communicates with the processor via internal cache bus. Network 194CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS net interface net reg 2nd level cache 2nd level cache processor chip processor chip core cache reg file ALU cache bus (a) An off-chip cache based implementation of the network interface processor chip net core net reg reg file cache ALU cache bus (c) A register-file based implementation of the network interface. cache bus (b) An on-chip cache based implementation of the network interface Figure 5.17: Three implementations of the network interface. 5.3 HOST NETWORK INTERFACE 195 interface accesses are somewhat faster because the network interface is on-chip. Register-File-Based Implementation The network interface registers take up part of the processor's register le and can be accessed as any other scalar register. The network interface commands are encoded into any of the unused bits of every triadic (three-register) instruction. It may take no additional cycles to access up to three network interface registers and to send commands to the network interface. Thus this interface is the most ecient out of the considered interfaces. 5.3.4 ATOMIC Host Interface ATOMIC is a point-to-point interface that supports Gbps data rate developed at USC/Information Sciences Institute ?]. The goal of this design is to develop LANs based on MOSAIC technology ?] that supports ne-grain, message-passing, massively parallel computation. Each MOSAIC chip is capable of routing variable length packets as a fast and smart switching element, while providing added value through simultaneous computing and buering. A MOSAIC - C Chip The architecture of a MOSAIC - C chip is illustrated in Figure 5.18. Each MOSAIC chip has a processor and associated memory. This processing capability can be utilized to lter messages, execute protocols, and arrange that data be delivered in the form expected by an application or virtual device specication. It communicates over eight external channels, four in the X-direction (east, west) and four in the Ydirection (south, north). All eight channels may be active simultaneously. Unless a MOSAIC - C node is the source or destination for a message, messages pass through its Asynchronous Router logic on their way to other nodes without interrupting the processor. When a node is either source or destination, packet data is transferred by a DMA controller in the packet Interface. The MOSAIC chip acts like switching element in ATOMIC and is interconnected in a multi-star conguration (see Figure 5.19). In this design, a message sent from node 1 to node 7 does not interfere with messages sent from node 4 to node 6 nor node 5 to node 2. Neither does the message sent from node 8 to node 9 interfere with that message. 196CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 640 Mb/s channels 2 Kbyte ROM 64 Kbyte RAM 14MIP Processor Packet Interface Asynchronous Router Figure 5.18: MOSAIC - C Processor Chip. 7 1 2 8 3 4 5 9 Figure 5.19: Message transfer in MOSAIC channels 6 5.3 HOST NETWORK INTERFACE 197 The MOSAIC host-interface board is versatile, multi-purpose, circuit board that houses a number of MOSAIC nodes with MOSAIC channels on one side and a microprocessor-bus interface on the other side. It is shown in Figure 5.20. Memoryless Mosaic Ribbon Cables Interface Logic SBus Memory Memoryless Mosaic Figure 5.20: SBus Host Interface Board by Memoryless MOSAIC chips. The purpose of the host-interface board are as follows: To verify the operation of the MOSAIC processor, router, and packet interface, fabricated together in memoryless MOSAIC chip. To provide a software-development platform. The bus interface on the boards presents the MOSAIC memory as a bank of ordinary RAM. Device-driver and support libraries allow low-level access to the host-interface boards by user programs on the host, which maps the MOSAIC memory into its internal address space. To serve as the interface between hosts and MOSAIC node arrays, as shown in Figure 5.21. The ribbon-cable connectors in the Figure 5.20 are wired to selected channels on the memoryless MOSAOC chips. Programs running on the memoryless MOSAIC 198CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Host Interface Host Interface Mosaic Array Figure 5.21: Standard connection of MOSAIC host interface and arrays. nodes communicate with the MOSAIC node array by message passing through the ribbon-cable connectors, and with the host computer by shared memory through the bus interface. Host-interface boards can also be chained together using ribbon cables, forming a linear array of MOSAIC nodes, and the basic for using MOSAIC components to build LAN. ATOMIC ATOMIC has attributes not commonly seen in current LANs: Hosts do not have absolute addresses. Packets are source routed, relative to senders' positions. At least one host process is an Address Consultant (AC) which can provide a source route to all the hosts on that LAN by mapping IP addresses to source routes. ATOMIC consists of multiple interconnected clusters of hosts. There are many alternate routes between a source and a destination. This exibility exploited by an AC provides bandwidth guarantees or minimizes switch congestion for high bandwidth ows, and allows load balancing across a cluster. 5.3 HOST NETWORK INTERFACE 199 Since ATOMIC trac ows do not interfere with each other unless they share links, aggregate performance of the entire network is limited only by its conguration. Each MOSAIC processor allows the network itself to perform complex functions such as encryption or protocol conversion. Topological exibility and programmability make ATOMIC suitable for any application. ATOMIC supports the IP protocol and therefore all the communication protocols above it such as UDP, TCP, ICMP, TELNET, FTP, SMTP, etc. An example of ATOMIC based on MOSAIC chip is ATOMIC and external LAN illustrated in Figure 5.22. To other cluster To other cluster Netstations Figure 5.22: Netstation LAN Topology. 200CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Host-Interface Processor (HIP) HIP is a communication processor developed at Syracuse University ?] capable of operating in two modes of operation such that either or both of these modes can be active at a given time. HIP is a master/slave multiprocessor system to run both standard and non-standard protocols. In the High-Speed Mode (HSM), the HIP provides applications with data rate close to that oered by network medium. This high speed transfer rate is achieved by using a high speed communication protocol (e.g., HCP ?]). Figure 5.23 shows the block diagram of the main functional units of the proposed HIP. The HIP design consists of ve major subsystems: a Master Processing Unit (MPU), a Transfer Engine Unit (TEU), a crossbar switch, and two Receive/ Transmit units (RTU-1, RTU-2). The architecture of HIP is highly parallel and uses hardware multiplicity and pipeline techniques to achieve high-performance transfer rates. For example, the two RTUs can be congured to transmit and/or receive data over highspeed channels while the TEU is transferring data to/from the host. In what follows, we describe the main tasks to be carried out by each subsystem. 5.3.5 Master Processing Unit (MPU) The HIP is a master/slave multiprocessor system where the MPU controls and manages all the activities of the HIP subsystems. The Common Memory(CM) is a dualport shared memory and can be accessed by the host through the host standard bus. Furthermore, this memory is used to store control programs that run on MPU. The MPU runs the software that provides an environment in which two modes of operation can be supported (HSM and NSM), and also executes several parallel activities (receive/transmit from/to the host, receive and/or transmit over the D-net, and receive/transmit over the normal network). The main tasks of the MPU are outlined as follows. HIP manager: This involves conguring the subsystems to operate in certain conguration, allocating and deallocating processes to HIP processors (RTUs). Furthermore, for the NSM, the HIP manager assigns one RTU to receive and/or transmit over the normal-speed channel in order to maintain the compatibility with the standard network and to reduce the communication latency. HLAN manager: This involves setting up a cluster of computers to cooperate in a distributed manner to execute a compute-intensive application using the D-net Figure 5.23: Block diagram of HIP 202CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS dedicated to the HSM of operation. The computer that owns this application will use the D-net to distribute the decomposed subtasks over the involved computers, synchronize their computations, and collect the results from these computers. Synchronizer: This involves arranging the execution order of HIP processes on RTUs and TEU such that their asynchronous parallel executions will not result in erroneous results and any deadlock scenarios. HLAN general management: This involves collecting information about the network activities to guarantee certain performance and reliability requirements. These tasks are related to network conguration, performance management, fault detection and recovery, accounting, security and load balancing. Transfer Engine Unit (TEU) The communication between the host and HIP is based on a request/reply model. The host is separated from controlling all aspects of the communication process. The initiation of data transfer is done by the host through the Common Memory (CM) and the completion of transfer is notied through an interrupt to the host. The TEU can be implemented simply as a Direct Memory Access Controller (DMAC). A similar protocol to that used in the VMP network adapter board 37] can be adapted to transport messages between the host and HIP. For example, to transmit data, the host initiates a request by writing a Control Block (CB) in the CM of the MPU. The CB contains pointers to the data to be sent, destination addresses and the type of transmission mode (HSM or NSM). The MPU then sets up the TEU and the crossbar switch, which involves selecting one of the two Host-to-Network memory (HNM) modules available in each RTU. When the data is being written in the selected HNM, the RTP of that unit can start the process of transmitting data over the supported channels according to the type of transmission. Similar activities are performed to receive data from the network and delivers it to the host. Switch This is a 22 maximum connections among the TEU, MPU, and RTUs. The use of local buses in MPU and RTUs allow any component of these subsystems to be accessed directly through the switch. 5.3 HOST NETWORK INTERFACE 203 Receive/Transmit Unit (RTU) The main task of the RTU is to ooad the host from getting involved in the process of transmitting/receiving data over the two channels. At any given time, the RTU can be involved in several asynchronous parallel activities: (1) receive and/or transmit data over the normal-speed channel according to standard protocols (2) receive and/or transmit data over the D-net or the S-net according to a high-speed protocol. Furthermore, the packet pipeline reduces signicantly the packet processing latency during the high-speed mode and consequently increases the overall system throughput. Otherwise, the tasks of decoding, encoding, data encryption, and checksumming should be done by the RTP, which would adversely aect the performance of the RTU. Host-to-ATM-Network Interface (AIB) The host-to-ATM-network interface board implements the functions of the ATM layer and AAL layer. The communication protocol model for ATM network consists of physical layer, ATM layer, AAL layer, and upper layers. The physical layer is based on SONET transmission standard. The ATM layer and AAL layer together are equivalent to the data link layer in the ISO model ?]. But the ATM switches implement routing functions that belong to the network layer. The upper layers above the AAL represent the protocol stack of the user and control applications, and they could implement the functions of TCP/IP as well as other standard protocols. The functions of the ATM layer are multiplexing/demultiplexing cells of dierent VPI/VCIs, relaying cells at intermediate switches, and transporting cells between two AAL peers. The ATM layer also implements ow control functions. But the ATM layer doesn't perform error control. The purpose of the AAL is to provide those capabilities necessary to meet the user layer data transfer requirement while using the service of the ATM layer. This protocol provides the transport of variable length frames (up to 65535 bytes in length) with error detection. The AAL is further divided into two sublayers: the segmentation and reassembly (SAR) sublayer and the convergence sublayer (CS) 22]. The SAR sublayer performs the segmentation of data packets associated with the higher layers into ATM cells at the transmission side and the inverse operation at the receiving side. The CS is service-dependent, and it could perform functions like message identication, etc. 204CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Implementation of ATM Layer and AAL Layer Recently, there has been an increased interest in ATM technology and designing ATM host interfaces. Davie at Bellcore described an ATM interface design for AURORA Testbed environment in 23]. The interface connects DEC 5000 workstations to SONET. Traw and Smith at University of Pennsylvania designed a host interface for ATM networks 24] which connects IBM RS/6000 workstations to SONET. These two designs connect two classes of workstations to the AURORA Testbed by utilizing the high speed I/O busses and DMA mechanism. But if the network speed increases, the data access from host's memory could become bottleneck. Assuming the memory access time is 100ns, and the memory bandwidth is 32 bits, the memory access rate is 320 Mbits/second. It can not support networks with speed greater than 320 Mbits/second. To solve this problem, we use the concept of direct cache access and allow the network interface to directly access the cache during sending and receiving of messages. This idea has been used in designing a host-to-network interface for a parallel computer, called MSC (Message Controller) that is used with AP1000 (an MIMD machine) 25]. The MSC works as a cache controller when there is no networking activities. During the message passing scenarios, the MSC directs data from the cache to the network interface. But it doesn't transfer data from the network interface to the cache to avoid unnecessary cache updating. The method of direct cache access makes the host and network interface more tightly coupled, and thus avoids waiting for data to be updated in main memory. For some systems using write back policy ( for instance, IBM RS/6000s ), this approach could be meaningful. Importantly, copying data from cache is much faster than from main memory. Assume that the cache access time is 10ns, which is 10 times faster than memory access time, and that the bandwidth of cache controller is 32 bits, then the data transfer rate can accommodate a network speed of 3.2 Gbits/second. Even considering the cache contention between the host's CPU and the network interface, the cache miss ratio, and other cache access overhead, the data transfer rate can still match the STS-12 rate (622 Mbps). With high speed transmission lines and high speed networking protocols, it is possible to have remote data access latency comparable to that experienced in main memory data access. Then the network-based distributed system will be a cost-eective high-performance alternative for parallel computing environment. To perform the functions associated with the ATM layer and AAL layer, the AIB is deigned to communicate with the upper layer protocols eciently through a shared Figure 5.24: Block Diagram of Interface Between AIB and Host memory, and to move data between host's cache/memory and network in the format of 53-byte ATM cells. Figure 5.24 shows the block diagram of the AIB and its connection with the host. The AIB consists of two units, Message Transmitting Unit (MTU) and Message Receiving Unit (MRU). Each message is segmented into xed size cells (53 bytes each) or reassembled from a series of ATM cells. As we discussed in the previous section, VPI/VCI and PTI together carry the routing information. The intermediate switches along a source-to-destination path relay cells to their destinations based on the information carried in cell headers. If a switch recognizes a certain cell is destined to its associated cluster, the switch will direct the cell to the MRU of that node. The MRU will dispatch the received cells to some message handling processes according to the VCI carried in the header. For existing computers, it is common that a CPU is connected to CMMUs (Cache Memory Management Unit) through a Processor Bus Interface. Then the CMMUs are interfaced to main memory through M-Bus (memory bus). Within a CMMU, there are basically two components, one is the cache, another is the MMU (Memory Management Unit) which is responsible for address translation and input or output data to or from the processor and the main memory. In our design, we require the M-Bus to be connected to the network interface. In the AIB, both the MTU and the MRU have a DM/CA controller that is used to move data between network and host. The DM/CA controller is connected to the M-Bus. The DM/CA controller can be considered to have two parts according to its functionalities. The rst part is associated with the main memory and performs DMA mechanism. The second part is associated with the cache and implements the the DCA (Direct Cache Access) mechanism. 206CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS On the transmitting side, the DCA controller communicates with the MMUs and requests the data from cache. When the DCA controller gets a cache hit, it will move data from the CMMU to the network FIFOs. If a cache miss occurs, the DMA controller will transfer data from the main memory to the network FIFOs. On the receiving side, the DCA controller writes data into cache if the addressed line is already in cache otherwise, the DMA mechanism is initialized to store data into main memory. The DM/CA is not allowed to cause cache update. Because both the host's CPU and the network interface want to access the cache, it might cause contention over the cache control and the M-Bus, and this could lead to performance degradation. But this contention is expected to be low because microprocessor systems are designed with the multiple level memory hierarchy of register le, cache, main memory and disk. Message Transmitting Unit Figure 5.25 shows the block diagram of the MTU that is responsible for processing message transmission. The upper layers communicate with the AIB software layers (AAL and ATM) based on request/response communication model. In this model, the host writes its command (control block request) into a shared memory model (SM). Upon receiving the commands, the Transmitting Processor (T-Processor) will perform relevant message sending functions associated with the AAL protocol. It will initialize the DM/CA by supplying the operation to be performed (read operation), the memory address, and the number of bytes to transfer. In this ATM interface, for each message to be sent, the T-Processor commands the DM/CA controller to move the next ATM cell body from the designated memory area. The T-Processor loads the DM/CA instructions (commands) in the Command Buer (see Figure 5.25) that will be fetched by the DM/CA controller. While the DM/CA controller moves data out, the T-Processor concurrently computes the header for the cells and puts the header in the header FIFO. The rst 4 bytes of the header are composed by the upper layer protocols and the network management components. The fth byte which is the checksum is calculated by the T-Processor according to the rst 4 bytes. These 5 bytes are written into the header FIFO. Cell bodies are stored in the cell body FIFO. For each pair of cell header and cell body, there is a Cell Composer that concatenates these two parts and delivers the completed cells to an STS-3c Framer. The Framer will put the cells in a frame based on SONET standards and transmit the frame over the network. The STS-3c Framer provides the functions of SONET Transmission Convergence (TC) Sub-layer. For the Figure 5.25: AIB Structure SONET STS-3c structure, this Framer provides cell transport at 149.76Mbps and an information payload transport rate of 135.632 Mbps. The Framer has an 8 bit wide data input and a number of control signals that indicate when data is required and when it is time to provide the start of a new cell 23]. The FIFO controllers will read the data out from the FIFOs and send them to the STS-3c Framer. If cell sequence number is necessary, the Cell Composer will insert the sequence number for each cell. In this case, each cell body is 44 bytes. 208CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 5.3.6 Message Receiving Unit The MRU is a microprocessor based system that has a Receiving Processor (RProcessor), DM/CA controller and other hardware circuitry. To process arriving messages eciently, we design the MRU to be a message driven receiver. In other words, the operation of the MRU is not controlled by the host, but by the received messages. This concept is referred to as \active message dispatching". The active message dispatching is that messages are dispatched to the corresponding message handling processes by extracting the information carried in the cell header. The VCI eld is mapped to the thread associated with a requested service. Based on the thread, the data portion of a message is transferred from the network FIFOs into the host's cache or memory. The data will be sent to cache only if the addressed line is already in the cache. If the addressed line is not in the cache, it means that the line is unlikely to be used recently. So the data should be sent to main memory. This scheme prevents the MRU from updating the cache to cause unnecessary cache misses. The host will be notied when data transfer is completed. The active message dispatching mechanism avoids waiting for a matching READ issued from the host. The data is processed and stored while other computations are performed. There are several advantages of this approach. First, it can overlap computations and communications, because the arrived messages can be transferred into the host's memory area without interrupting the host CPU. Secondly, active message dispatching doesn't involve the host's operating system, because message processing is performed oine using the AIB. Consequently, message passing based on active message dispatching will result in low latency communications. The architecture of the MRU is shown in Figure 5.25. Upon receiving ATM cells from the STS-3c Framer, a Cell Splitter will separate a cell header from its body. At rst, the HEC eld in the header will be checked by a hardware component, HEC Checker. If an error is detected, the corresponding message is dropped. The HEC Checker will then report to the R-Processor about this error. The R-Processor will notify the host in its turn about this error. The upper layer protocols implemented in the host system will request a retransmission. The ATM layer protocol is not responsible for error recovery or retransmission. If the message can pass HEC checking, the cell sequence will be reassembled into an AAL-PDU. Then the R-Processor will command some hardware circuitry to further perform the AAL protocol processing, such as checking the length of the message and the CRC eld in the AAL-PDU (as explained in Section 3), If the CRC calculation doesn't result in a correct polynomial function, it means that the message is corrupted somehow. An Error Corrector will 5.3 HOST NETWORK INTERFACE 209 try to correct the error. If the attempt of error correction fails, the message is dropped and the host is notied. Another case is that some cells are missing during transmission. The length of message will not match with the value in the length eld. Again, the message is dropped and a notication is sent to the host. The communication between the host and the MRU is via the SM. Upon receiving information from the MRU, the SM controller will notify the host by either interrupt or polling schemes. If the message is received error free, the R-Processor will further process the message according to the header information. For ow control messages, it is necessary for the MRU to communicate with the MTU, because the MTU is the one that controls cell transmission. The MTU can stop cell transmission when congestion occurs. If the arrived message carries a request for the value of a variable in a certain distributed application, the MRU will let the MTU fetch the value and transmit it back to the remote waiting process. Because the MTU is responsible for fetching data from the host's cache/memory and transmitting the value to the remote node. The connectivity between the MTU and MRU further ooads the host of the communication task, as the AIB can handle communication jobs more independently. 210CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 5.4 Hardware Implementations of Standard Transport Protocols The emergence of new high-speed networks has shifted the bottleneck from the network's bandwidth to the processing of the communication protocols. There has been an increased interest in the development of new communication subsystems that are capable of utilizing high-speed networks. The current implementations of the standard communication protocols are ecient when networks operate at several kbps. However, the performance of these implementations cannot match the bandwidth of the high-speed networks that operate at megabit or gigabit speeds. One research approach is to improve the implementation of the communication subsystems that are based on the existing standard protocols68, 69, 70]. This approach focuses on o-loading the host from processing communication protocols, and uses external microprocessor-based systems to do this task. Parallelism in communication protocols is also viewed as means to boost the performance of the communication subsystems. In this section, we analyze the performance of dierent TCP/IP implementations and compare their performance. The simulation tool OPNET79] is used in modeling and analyzing all the implementation techniques discussed in this section. 5.4.1 Base Model: TCP/IP Runs on the Host CPU It has been largely agreed upon, that protocol implementation plays a signicant role in the performance of the communication subsystems62, 66]. One needs to analyze the various factors that contribute to the delay associated with the data transmission in order to identify ecient techniques to improve the implementations of standard protocols. The analysis presented in this section, uses a system that consists of two nodes, each running at 15 MIPS. Nodes communicate with each other using a pointto-point communication link with a bandwidth of 500 Mbps. Each node runs TCP/IP, and has a running user process. One of the nodes transmits a series of segments to the other node. Segment generation is modeled as a Poisson distribution. Each node executes approximately 300 instructions of fast path TCP/IP62]. It is also assumed that these 300 instructions are mapped to 400 RISC instructions. In this implementation TCP/IP runs on the host CPU as shown in Figure 5.26. A user process sends a message by making a system call to invoke the communication process (TCP). The host CPU will then make a context switching between the user 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS211 Host CPU Host Memory User Space User Process TCP Kernel Space IP Host Network Interface Network Controller Network Buffer To Network Figure 5.26: TCP/IP Run on Host CPU process and the communication process72]. Data to be sent is then copied from the user memory space to the kernel memory space. The TCP/IP processes are then executed to generate segments that encapsulate the outgoing data. A cut-through memory management scheme is assumed. This will minimize the overhead of data copying. In this scheme, instead of moving the whole segment between TCP and IP, only the reference of the segment is passed63]. Segments ready to be sent are then copied to the network buer. Memory used in this analysis is assumed to have an access time of 60 ns per word (a 32-bit word). The checksumming overhead is 50 ns/byte. Note that in this model, the host CPU is involved in both data processing and copying. Upon arrival, segments will be transferred from network buer to the kernel space of the host memory. Since the host CPU is used to move data, two memory cycles are needed for every transferred word. After processing the incoming segment by TCP/IP, data is then transferred from the kernel space to the user space. The complete sequence of events in transmitting and receiving segments is illustrated in Figure 5.27. For each sent (or received) segment, the CPU will make a context switch between the user process and the communication process. Based on the analysis presented in64, 65], we can estimate the context switching overhead to be approximately 150 212CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Transmitter User Process Tc Context Switching U/K Transfer 1 Tu Tp TCP/IP Ts Checksumming K/N Transfer 2 Tn Tt Receiver User Process Rc Context Switching Ru K/U Transfer Rp TCP/IP Rs Checksumming Rn N/K Transfer 1 User/Kernel Transfer COPY = Tu + Tn + Ru + Rn CONTEXT = Tc + Rc 2 Kernel/Network Transfer CHECKSUM = Ts + Rs TCP/IP = Tp + Rp TRANS = Tt Figure 5.27: Application-to-application latency 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS213 microseconds. The end-to-end delay associated with segment transmission can be divided into three classes of factors: Per byte delay factors : 1. The data copying, \COPY". 2. The checksumming, \CHECKSUM". 3. The transmission media, \TRANS". Per segment delay factors : 1. The protocol processing, \TCP/IP". 2. The context switching, \CONTEXT". Per connection delay factor : This factor is negligible for large stream-oriented data transfer. Figure 5.28 shows the contribution of these delay factors to the overall segment transmission time, the cumulative end-to-end delay is illustrated for dierent segment sizes. Figure 5.29 presents the throughput achieved for dierent segment sizes. For a segment size of 4096 bytes, the estimated throughput is around 50 Mbps. With network bandwidth of 500 Mbps, the eective bandwidth is around 10% of the network bandwidth. Therefore, this model is inecient when the network operates at several hundreds of Mbps. As it is shown in Figure 5.28, the data copying, checksumming, and context switching delays are the main bottlenecks in this model. 5.4.2 TCP/IP Runs Outside the Host The context switch overhead can be drastically reduced (if not eliminated), if the protocols are running on a dedicated processor outside the host CPU. Data copying can also be decreased by using DMA instead of the host CPU. O-loading the processing of communication protocols from the host CPU will increase the CPU time available to the user processes. Figure ?? shows a block diagram of a front-end processor that runs TCP/IP and is attached via a high-speed interface to the host. Data is copied between the network 214CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS C u m u l a t vi e D e al y (msec) 1.4 1.2 1 0.8 0.6 3+ 3+ 0.4 2 0.2 2 04 4 512 1024 3+ 2 4 3+ 2 3+ TRANS 3 CONTEXT + 2 COPY 2 CHECKSUM 4 4 TCP/IP 4 2048 3072 4096 Segment Size (byte) Figure 5.28: Cumulative Delay vs. Segment Size 50 45 T 40 h r ou 35 g hp 30 u 25 3 t 20 (Mbps) 15 3 10 512 1024 3 3 3 2048 3072 4096 Segment Size (byte) Figure 5.29: Throughput vs. Segment Size 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS215 Host CPU Host Buffer Host Protocol Processor Protocol Processor TCP Checksum Controller Protocol Buffer IP DMA Controller Network Interface Network Buffer Network Controller To Network Figure 5.30: TCP/IP-based Communication Subsystem and the protocol buers using DMA. DMA is also used to copy data between the protocol and host buers. Checksumming is done by an additional VLSI circuitry on the communication processor. The checksum is computed while data is being copied between the protocol and network buers. This conguration leads to a signicant improvement over the base model as it is illustrated in Figures 5.31 and 5.32. For a segment size of 4096 bytes, the throughput is around 220 Mbps in contrast to only 50 Mbps using the base model. Also, Figure 5.31 shows that the data copying factor increases with the increase of the segment size, and becomes the main bottleneck. 5.4.3 Interleaved Memory Communication Subsystem Here, an ecient and simple concept is used to reduce the data copying delay. The approach is based on the idea of memory interleaving to implement the subsystem's buer. In this approach, the memory is partitioned into a number of independent modules. There are several congurations of memory interleaving, the one used in this model is called low-order interleaving. In this conguration, the low order bits of the memory address are used to select the memory module while the higher order bits are used to select (read/write) the data within a particular module. Assume the memory is partitioned into k modules, address lines a0 - am;1 (k = 2m) are used to 216CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 0.4 C u m 0.35 u 0.3 l a 0.25 t vi 0.2 e D 0.15 3 e la 0.13 + y 0.05 + (msec) 2 2 0 512 1024 3 3 3 + 2 2 + TRANS 3 COPY + + 2 TCP/IP 2 2048 3072 4096 Segment Size (byte) Figure 5.31: Cumulative Delay vs. Segment Size 220 200 T h 180 r ou g 160 hp 140 3 u t 120 (Mbps) 1003 80 512 1024 3 3 3 2048 3072 4096 Segment Size (byte) Figure 5.32: Throughput vs. Segment Size 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS217 Data Bus Memory Controller Module 0 R/W Module 1 ........ Module 2 Module 7 a3an-1 a2 Decoa1 a0 der . . . Figure 5.33: Interleaved Memory System select a particular module, while the remaining address lines (am - an;1 ) are used to select the data word within the module. The k modules, each with a size of 2n;m , gives a total memory size of 2n words. Figure ?? shows an 8-way interleaved memory system. Let ta be the memory access time, then the total time (Tsum ) needed to access l words in this memory system is : Tsum = ta + ((l ; 1) tka ) For large l, Tsum ' (l ; 1) tka Then, the average access time for this memory system Taccess would be, ta Taccess = Tsum ' l k And the bandwidth is : words=s BW = tk a Figure ?? illustrates the sequence of memory accesses in the interleaved memory system. 218CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Access 7 M7 Access 6 M6 Access 5 M5 Access 4 M4 Access 3 M3 Access 11 Access 2 M2 Access 10 Access 1 M1 Access 9 Access 0 Access 8 M0 M0 M1 M2 M3 M4 M5 M6 M7 M0 M1 M2 M3 Time t a t /8 a Figure 5.34: Memory Accesses Sequence Performance Analysis In this model, both the network and protocol buers are congured as an 8-way interleaved memory, as shown in Figure ??. The main improvement lies in the data copying overhead incurred in decreasing segment transfer delay between the network and protocol buers. The time needed in this operation is approximately 1/8 the time needed in the previous model. The elapsed time to move data between the protocol and host buers remains the same as that of the previous model. Figures 5.36, 5.37 show the estimated cumulative end-to-end delay and throughput for this model. The achievable throughput (see Figure 5.37) is approximately 340 Mbps using a segment size of 4096 bytes. On the other hand, for a segment size of 512 bytes, the throughput achieved is around 115 Mbps. 5.4.4 TCP/IP Runs on A Multiprocessor System Applying parallelism in designing communication subsystems is an important approach to achieve the high-performance needed in today's distributed computing environment. Zitterbart61] discussed the dierent levels and types of parallelism that are typically applied to communication subsystems design. We adopt a hybrid par- 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS219 Host CPU Host Buffer Host Protocol Processor Protocol Buffer Protocol Processor Checksum Controller * TCP DMA Controller IP Network Buffer Network Interface * Network Controller * 8-way Interleaved Memory To Network Figure 5.35: Interleaved Memory Communication Subsystem 0.3 C u 0.25 m u l a 0.2 t vi 0.15 e D 0.1 3+ e la 3+ y 0.05 2 2 (msec) 0 512 1024 3 3 3 + 2 2 + TRANS + 3 COPY + 2 TCP/IP 2 2048 3072 4096 Segment Size (byte) Figure 5.36: Cumulative Delay vs. Segment Size 220CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 350 T 300 hr o u 250 g h p 200 u t (Mbps) 150 3 3 3 3 3 100 512 1024 2048 3072 4096 Segment Size (byte) Figure 5.37: Throughput vs. Segment Size allelism approach, which is based on layer and packet levels. In layer parallelism, dierent layers of the hierarchical protocol layers are executed in parallel, where in packet parallelism, a pool of processing units is used to process incoming (and outgoing) packets concurrently. 5.4.5 Parallel Architectural Design The parallel implementation approach presented in this section is similar to that discussed in66]. In the design shown in Figure ??, we use a processor (IP proc) to handle the IP processing, and four transport processors (proc 1, proc 2, proc 3, and proc 4) are used to handle the TCP processing. On the arrival of a segment, IP proc executes the IP. Then, one of the transport processors is selected according to a round robin scheduling policy, to run the TCP for the arrived segment. Therefore, multiple segments can be processed concurrently using dierent transport processors. Since the IP processing is approximately 20% of the total TCP/IP processing time62], four processors are utilized to run the TCP and one processor to run the IP (see Figure ??). The other modules of this design are : 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS221 Host Buffer Host Processor Host DMA Controller S_mem Proc Proc 1 Shared Mem Proc 2 Proc 3 * Proc. Buffer Proc 4 * Proc. Buffer * * Proc. Buffer Proc. Buffer * Proc. Buffer Proc Buffer IP Proc DMA Controller Protocol Processor Network Interface * * 8-way interleaved memory Network Buffer Network Controller To Network Figure 5.38: Parallel Communication Subsystem 222CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS seg 4 Proc 4 ............... | seg 3 Proc 3 .. . . . . . . . . . . . . . . | seg 6 seg 2 Proc 2 | seg 5 seg 1 | Proc 1 IP Proc seg 1 ......... | | ............ | | seg 2 | seg 3 | seg 4 | seg 5 | seg 6 | ........... T Time Figure 5.39: The Sequence of Segments Processing Shared Memory : This memory block is shared by the transport processors to keeps the context records of the established connections. Shared Memory Processor (S mem Proc) : This processor has two main tasks: shared memory management, and acknowledgment of the received segments. DMA Controllers : DMA controllers are used to move data between the dierent buers, segments transfer between the network buer and the IP proc buer, segments transfer between the IP proc buer and the transport processor buer, and data transfer between the transport processor buer and the host buer. Context Records A stream of transmitted data segments has an inherited sequential ordering structure. For this reason, some of the connection state variables should be stored in the shared memory module. In TCP, we identify the elements of the Transmission Control Block (TCB) that should be shared between the transport processors. These elements are kept in a context record that is maintained in the 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS223 shared memory for each established connection. Each context record consists of the following elds (see Figure ??): Source, Destination port number : Used as an ID to the established connection. Receive next: The sequence number of the next expected segment. This variable is used to generate the acknowledgment for the received segments. Send unacked : The highest sequence number of segments that are sent but not acknowledged yet. The acknowledgment eld of the arrived segments is used to update this variable. Send next : The sequence number of the next segment to be transmitted. Local window information : Corresponds to the reserved space allocated to this connection. Remote window information. Each line in the subsequent entries in the context record shown in Figure ?? correspond to a received segment. They consists of three elds: a pointer to the starting address of the segment, the segment size, and the initial sequence number of the segment. For each segment arrival, the transport processor accesses the shared memory to manipulate the connection state variables according to the information pending in the arrived segment. It also adds a line that contains the three elds mentioned before. S buf Proc periodically accesses the context record of the established connections, to look for contiguous blocks in the received segments. It updates \receive next" accordingly, and appends the required acknowledgment to a segment in the reverse direction. If there is no such a segment, it will generate a separate acknowledgment. Due to the existence of this shared resource, a conict exists between the transport processors in accessing this shared memory. If more than one processor tries to access the shared memory (for a write operation), only one is granted and the others should wait. 224CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS Source Port Number Dest. Port Number Send_Next Send_Unacked Receive_Next Remote Window Local Window Information Start address pointer Start address pointer . . . . . . Size Seq_num Size Seq_num . . . . . . . . . . . . Figure 5.40: Context Records Performance Analysis In order to analyze the throughput oered by the parallel TCP/IP implementation, an estimate of the contention overhead caused by accessing the shared memory must be computed. The approach used here is similar to that presented in 6]. Assume that the processing cycle is T (as shown in Figure ??), which is approximately 240 instructions (the number of instructions used to process TCP). About 15 instructions are used to access the shared memory for the processing of every arrived segment. With 4 processors, 60 (4 15) instructions of shared memory accesses, or 25% (60/240) of T. Therefore only 25% of the shared memory accesses are expected to undergo contention. The contention penalty is then estimated to be four more instructions. Figure 5.41 shows the throughput achieved using this implementation approach for dierent segment sizes. For a segment size of 4096 bytes, the throughput achieved is approximately 375 Mbps. On the other hand, for a segment size of 512 bytes, the throughput is around 195 Mbps. Since this approach increases the performance by parallelizing the TCP processing, it is more eective in the case where the TCP processing time plays a more signicant role in the overall end-to-end delay, and that is when segment sizes are small. 5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS225 380 360 T 340 hr 320 o u 300 g h 280 3 p 260 u 240 t (Mbps) 220 2003 180 512 1024 3 3 3 2048 3072 4096 Segment Size (byte) Figure 5.41: Throughput vs. Segment Size 5.4.6 Discussion In this section a performance analysis of four dierent TCP/IP implementation models has been presented. A simulation model has been built to estimate the performance measures. We have also identied the contribution of the dierent delay factors in the overall end-to-end data transmission delay. We have introduced an ecient approach to implement the communication subsystem's buer based on the memory interleaving concept. We have also introduced a parallel TCP/IP implementation. In this approach, multiple processors are employed to process segments concurrently. Figure 5.42 shows the estimated throughput for the four models discussed in this section. In the memory interleaved communication subsystem (the third model), the estimated throughput is around 340 Mbps using a segment size of 4096 bytes, that corresponds to a throughput of 220 Mbps in the second implementation model, or an increase of 55%. On the other hand, for a segment size of 512 bytes, the achieved throughput is around 115 Mbps, in contrast to 98 Mbps using the second model, which corresponds to an increase of only 17%. This indicates that the interleaved memory communication subsystem approach provides a signicant performance improvement for larger segment sizes. Using the parallel implementation approach (the fourth 226CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS 400 350 T 300 hr 3 o u 250 g h 2003 + p 150 2 u t 100 + (Mbps) 2 50 0 512 1024 3 + 2 2048 3 3 MODEL-4 3 2 2 MODEL-2 2 MODEL-1 + + MODEL-3 + 3072 4096 Segment Size (Byte) Figure 5.42: Throughput vs. Segment Size model), the throughput achieved is approximately 375 Mbps for a segment size of 4096 bytes. That is an increase of only 10% over the interleaved memory implementation approach. On the other hand, for a segment size of 512 bytes, the throughput is around 195 Mbps, that represents an increase of about 70%. Therefore, as the segment size increases, the throughput dierence between the interleaved memory approach and the parallel implementation approach diminishes (see Figure 5.42). On the other hand, as the segment size increases, the throughput dierence between the interleaved memory approach and the second approach grows. In general, using the interleaved memory concept to implement the communication subsystem's buer can signicantly improve the data copying delay, which is the main bottleneck in the overall end-to-end data transmission delay for stream oriented, bulk data transfers where large segment sizes are used. Bibliography 1] E. Biagioni, E. Cooper, R. Sansom, "Designing a practical ATM LAN," IEEE Network, vol.7, no. 2. pp.32-39. 2] Many, "Network Compatible ATM for Local Network Applications - Phase 1 V1.01," 19 Oct 1992. 3] D. J. Greaves, D. McAuley, "Private ATM netorks," IFIP Transactions C (Communication Systems), vol.C-9, p.171-181. 4] G.J. Armitage. K.M. Adams, "Prototyping an ATM adaptation Layer in a multimedia terminal," International Journal of Digital and Analog Communication Systems, vol.6, no. 1. p.3-14. 5] Fouad A. Tobagi, Timothy Kwok, "Fast Packet Switch Architectures and the Tandem Banyan Switching Fabric," High-Capacity Local and Metropolitan Area Networks, p. 311-344. 6] Fouad A. Tobagi, "Fast Packet Switch Architectures for Broadband Integrated Services Digital Networks," Procedings of the IEEE, vol. 78, no. 1, p. 133-167, January 1990. 7] Jean-Yves Le Boudec, "The Asynchronous Transfer Mode: a tutorial," ?????. 8] H. Ahmadi, et al., "A high performance switch fabric for integrated circuit and packet switching," in Proceedings of INFOCOM'88, New Orleans, LA, March 1988, pp. 9-18. 9] C. P. Kruskal and M. Snir, "The performance of multistage interconnection networks for multiprocessors," IEEE Trans. Computers, vol. C-32, no. 12, pp. 1091-1098, Dec. 1983. 227 228 BIBLIOGRAPHY 10] M. Kumar and J. R. Jump, "Performance of unbuered shue-exchange networks," IEEE Trans. Computers, vol. C-35, no. 6, pp. 573-577, June 1986. 11] V. E. Benes, "Optimal rearrangeable multistage connecting networks," Bell Systems Technical Journal, vol. 43, no. 7, pp. 1641-1656, July 1964 12] K. E. Batcher, "Sorting Networks and their applications," in AFIPS proceedings of the 1968 Spring Joint Computer Conference, vol. 32, pp. 307-314 13] H. Suzuki, H. Nagano, T. Suzuki, T. Takeuchi, S. Iwasaki, "Output-buer Switch Architecture for Asynchronous Transfer Mode," in Proceedings of the international conference on Communications, Boston, MA, June 1989, pp. 4.1.1.4.1.5. 14] Y. Yeh, M. G. Hluchyj, A. S. Acampora, "The Knockout Switch: A Simple, Modular Architecture for High-Performance Packet Switching," in IEEE Journal on Selected Areas in Communications, vol. SAC-5, no. 8, October 1987. 15] Kanakia,H. and Cheriton,D.R. \The VMP Network Adapter Board (NAB) High Performance Network Communication for Multiprocessors," Proc. of the SIGCOMM `88 Symp. on Communications Architectures and Protocols, pp. 175187. ACM, August 1988 16] Arnould,E.A, et al., \The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers," Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, MA, April 1989 17] Kung,H.T.et al., \Network-Based Multicomputers: An Emerging Parallel architecture, " Supercomputing Conf., November 1991 18] Finn,G. \An Integration of Network Communication With Workstation Architecture, " ACM Computer Communication Review, Vol.21, No.5, October 1991 19] Kung,H.T. \Gigabit Local Area Networks: A System Perspective," IEEE Communication Magazine, pp. 79-89, April 1992 20] Dally,W.J. and Seitz,C.L. \Deadlock-free Message Routing in Multiprocessor Interconnection Networks" Computer Science Department, California Institute of Technical Report 5231:TR:86, 1986 BIBLIOGRAPHY 229 21] Sullivan,H. et al., \A Large Scale Homogeneous Machine", Proc. 4th Annual Symposium on Computer Architecture, 1977, pp 105-124 22] Cheung,N.K. \The Infrastructure for Gigabit Computer Networks," IEEE Communication Magazine, April, 1992, pp 60-68 23] Davie,B.S. \A Host-Network Interface Architecture for ATM," Proc. ACM SIGCOMM '91, Zurich, September 1991 24] Traw,C.B.S. and Smith,J.M \A High Performance Host Interface for ATM Networks," Proc. ACM SIGCOMM '91, Zurich, September 1991 25] Shimizu,T.,et al., \Low-latency Message Communication Support for the AP1000," 1992 ACM, pp. 288-297 26] Giacopelli,J.N.,et al., \Sunshine: A High-Performance Self-Routing Broadband Packet Switch Architecture," IEEE Journal on Selected Areas in Communications, Vol.9, No.8, October 1991 pp. 1289-1298 27] Akata,M.,et al., \A 250 MHz 32 32 COMS Crosspoint LSI with ATM Switching Function," NEC Research Journal, 1991. 28] Pattavina,A.,\Multichannel Bandwidth Allocation in a Broadband Packet Switch," IEEE Journal on Selected Areas in Communications, Dec.,1988, pp.1489-1499. 29] Peter A. Steenkiste, et al, \A Host Interface Architecture for High-Speed Networks," High Performance Networking, IV (C-14), 1993 IFIP, pp. 31-46. 30] Dana S. Henry and Christopher F. Joerg, \A Tightly-Coupled ProcessorNetwork Interface," ASPLOS V, 1992 ACM, ? 31] T. F. La Porta and M. Schwartz,\ Architectures, Features, and Implementation of High-Speed Transport Protocols," IEEE Network Magazine, pp. 14{22, May 1991. 32] Z. Haas,\ A Communication Architecture for High-Speed Networking," IEEE INFOCOM, San Francisco, pp.433{441, June 1990. 230 BIBLIOGRAPHY 33] A. Tantawy, H. Hanafy, M. E. Zarky and Gowthami Rajendran, \ Toward A High Speed MAN Architecture," IEEE International Conference on Communication (ICC'89), pp. 619{624, 1989. 34] O. Menzilcioglu and S. Schilck, \ Nectar CAB: A High-Speed Network Processor," Proceedings of International Conference on Distributed Systems, pp. 508{515, July 1991. 35] E. C. Cooper, P. A. Steenkiste, R. D. Sansom, and B. D. Zill, \ Protocol Implementation on the Nectar Communication Processor," Proceedings of the SIGCOMM Symposium on Communications Architecture and Protocols, pp. 135{144, August 1990. 36] H.T. Kung et al., \ Parallelizing a New Class of Large Applications over Highspeed Networks," Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 167{177, April 1991. 37] H. Kanakia and D. R. Cheriton, \ The VMP Network Adapter Board: Highperformance Network Communication for Multiprocessors," Proceedings of the SIGCOMM Symposium on Communications Architectures and Protocols, pp. 175{187, August 1988. 38] D. R. Cheriton and C. L. Williamson, \ VMTP as the Transport Layer for High-Performance Distributed Systems," IEEE Communication Magazine, pp. 37{44, June 1989. 39] G. Chesson, \ The Protocol Engine Design," Proceedings of the Summer 1987 USENIX Conference, pp. 209{215, November 1987. 40] XTP 3.4, \ Xpress Transfer Protocol Denition - 1986 Revision 3,4," Protocol Engines Inc., July 1989. 41] W. T. Strayer, B. J. Dempsey and A. C. Weaver, XTP: The Xpress Transfer Protocol, Addison Wesley, 1992. 42] M. S. Atkins, S. T. Chanson and J. B. Robinson, \ LNTP - An Ecient Transport Protocol For Local Area Networks," Proceedings of Globecom'88, pp. 2241{ 2246, 1988. BIBLIOGRAPHY 231 43] W. S. Lai,\ Protocols for High-Speed Networking," Proceedings of the IEEE INFOCOM, San Francisco, pp. 1268{1269, June 1990. 44] D. D. Clark, M. L. Lambert and L. Zhang,\ NETBLT: A High Throughput Protocol," Proceedings of SIGCOMM'87, Computer communications review, Vol. 17, No. 5, pp. 353{359, 1987. 45] N. Jain, M. Schwartz and T. R. Bashkow,\ Transport Protocol Processing at GBPS Rates," Proceedings of the SIGCOMM Symposium on Communications Architecture and Protocols, pp. 188{198, August 1990. 46] D.D. Clark and D. L. Tannenhouse,\ Architectural Considerations for a New Generation of Protocols," Proceedings of the ACM SIGCOMM Symposium on Communications Architecture and Protocols, pp. 200{208, September 1990. 47] M. Zitterbart,\ High-Speed Transport Components," IEEE Network Magazine, pp. 54{63, January 1991. 48] A. N. Netravali, W. D. Roome, and K. Sabnani,\ Design and Implementation of a High-Speed Transport Protocol," IEEE Trans. on Communications, pp. 2010{2024, November 1990. 49] A. S. Tanenbaum, Computer Networks, 2nd Edition, Prentice-Hall, 1988. 50] M. Schwartz, Telecommunication Networks: Protocols, Modelling and Analysis, Addison Wesley, 1988. 51] D. D. Clark et al. ,\ An Analysis of TCP Processing Overhead," IEEE Communications magazine, pp. 23{29, June 1989. 52] S. Heatley and D. Stokesberry, \ Analysis of Transport Measurements Over Local Area Network," IEEE Communications Magazine, pp. 16{22, June 1989. 53] L. Zhang,\ Why TCP Timers Don't Work Well," Proceedings of the ACM SIGCOMM Symposium on Communications Architecture and Protocols, pp. 397{ 405, 1986. 54] C. Partridge,\ How Slow Is One Gigabit Per Second? " Computer Communication Review, Vol. 20, No. 1, pp. 44{52, January 1990. 232 BIBLIOGRAPHY 55] W. A. Doeringer et al., \ A Survey of Light-Weight Transport Protocols for High-Speed Networks," IEEE trans. Communications, Vol. 38, No. 11, pp. 2025{2039, November 1990. 56] H. T. Kung, \ Gigabit Local Area Networks: A Systems Perspective," IEEE Communications Magazine, pp. 79{89, April 1992. 57] I. Richer,\ Gigabit Network Applications," Proceedings of the IEEE INFOCOM San Francisco, p. 329, June 1990. 58] C. E. Catlett,\ In search of Gigabit Applications," IEEE Communications Magazine, pp. 42{51, April 1992. 59] J. S. Turner, \ Why We Need Gigabit Networks," Proceedings of the IEEE INFOCOM, pp. 98{99, 1989. 60] E. Biagioni, E. Cooper and R. Sansom, \ Designing a Practical ATM LAN," IEEE Network, pp. 32{39, March 1993. 61] M. Zitterbart, \Parallelism in Communication Subsystems," IBM Research Report RC 18327, Sep. 1992. 62] D. Clark, V. Jacobson, J. Romkey, and H. Salwen, \An Analysis of TCP Processing Overhead," IEEE Communications Magazine, Vol. 27, pp. 23-29, June 1989. 63] C.M. Woodside and J.R. Montealegre, \The Eect of Buering Strategies on Protocol Execution Performance," IEEE Trans. on Communications, Vol. 37, pp. 545-553, June 1989. 64] E. Maa and B. Bhargava, \Communication Facilities for Distributed Transaction-Processing Systems," IEEE Computer, pp. 61-66, August 1991. 65] M. S. Atkins, S.T. Chanson, and J.B.Robinson, \LNTP-An Ecient Transport Protocol for Local Area Networks," Proceedings of Globecom'88, Vol. 2, pp. 705-710, 1988. 66] N.Jain, M. Schwartz, and T. Bashkow, \Transport Protocol Processing at GBPS Rates," Proceedings of ACM SIGCOMM'90, pp. 188-199, August 1990. BIBLIOGRAPHY 233 67] H. Kanakia and D. Cheriton, \The VMP Network Adaptor Board (NAB): High-Performance Network Communication for Multiprocessors," Proceedings of ACM SIGCOMM'88, pp. 175- 187, August 1988. 68] E. Rutsche and M. Kaiserswerth, \TCP/IP on the Parallel Protocol Engine," Proccedings of IFIP fourth International Conference on High Performance Networking, pp. 119-134, Dec. 1992. 69] K. Maly, S. Khanna, R. Mukkamala, C. Overstreet, R. Yerraballi, E. Foudriat, and B. Madan, \Parallel TCP/IP for Multiprocessor Workstations," Proccedings of IFIP fourth International Conference on High Performance Networking, pp. 103-118, Dec. 1992. 70] O.G. Koufopavlou, A. Tantawy, and M. Zitterbart, \Analysis of TCP/IP for High Performance Parallel Implementations," Proceedings of 17th Conference on Local Computer Networks, pp. 576-585, Sep. 1992. 71] K. Hwang and F. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill 1984. 72] D.E. Comer and D.L. Stevents, Internetworking with TCP/IP, Vol I, Design, Implementation, and Internals, Prentice- Hall, 1991. 73] R.M. Sanders and A. Weaver, \The XTP Transfer Protocol (XTP) - A Tutorial," ACM Computer Communication Review, Vol. 20, pp. 67-80, Oct. 1990. 74] D.D. Clark, M. Lambert, and L. Zhang, \NETBLT: A High Throughput Transport Protocol," Proceedings of ACM SIGCOMM'87, pp. 353-359, August 1987. 75] W.A. Doeringer, D. Dykeman, M. Kaiserswerth, B. Meister, H. Rudin, and R. Williamson, \A Survey of Light- Weight Transport Protocols for High-Speed Networks," IEEE trans. on Communications, Vol. 38, pp. 2025-2039, Nov. 1990. 76] T.F. La Porta and M. Schwartz, \Architectures, Features, and Implementation of High-Speed Transport Protocols," IEEE Network Magazine, pp. 14-22, May 1991. 77] H. Meleis and D. Serpanos, \Designing Communication Subsystems for HighSpeed Networks," IEEE Network Magazine, pp. 40-46, July 1992. 234 BIBLIOGRAPHY 78] T. La Porta and M. Schwartz, "A High-Speed Protocol Parallel Implementation: Design and Analysis," Proccedings of IFIP fourth International Conference on High Performance Networking, pp. 135-150, Dec. 1992. 79] A. Cohen et al., OPNET Modeling Manual, MIL 3, Inc., 1993. Chapter 7 Remote Procedure Calls 279 280 CHAPTER 7. REMOTE PROCEDURE CALLS 7.1 Introduction Remote Procedure Calls (RPC) are one paradigm for expressing control and data transfers in a distributed system. As the name implies, a remote procedure call invokes a procedure on a machine that is remote from where the call originates. In its basic form, an RPC identies the procedure to be executed, the machine it is to be executed on, and arguments required. The application of this RPC model results in a client/server arrangement where the client is the application that issued the call and the server is the processor that handles the call. The RPC RPC mechanism diers from the general IPC model in that the processes are typically on remote machines. RPC oers a simple means for programmers to write software that utilizes system wide resources (including processing power) without having to deal with the tedious details of network communication. It oers the programmer an enhanced and very powerful version of the most basic programming paradigm, the procedure call. Each RPC system has ve main components that include compile-time support, binding protocol, transport protocol, control protocol, and data representation[3]. In order to be able to implement a reliable and ecient RPC mechanism, there are several issues that the designer must address in each component. These issues are the semantics of a call in the presence of a computer or communication failure, the semantics of address-containing arguments in the absence of shared memory space, integration of remote procedure calls into existing programming system, binding, server manipulation, transport protocols for the transfer of data and control between the caller and the callee, data integrity and security, and error handling[1]. This chapter addresses the RPC mechanism and the requirements to produce an ecient, easy to use, and semantically transparent mechanism for distributed Draft: v.1, April 4, 1994 7.2. RPC BASIC STEPS 281 7.2 RPC Basic Steps RPC may be viewed as a special case of the general message passing model of interprocess communication[?]. The message based IPC involves a process sending a message to another process on another machine, but it does not necessarily need to be sychronized on either the sender or the receiver process. Sychronization is an important aspect of RPC since the mechanism models the local procedure call. PRC passes parameters in one direction to the process that will execute the procedure, the calling process is blocked until execution is complete, and the results are returned to the calling process. Each RPC system strives to produce a mechanism that is syntatically and semantically transparent to the dierent languages that are being used to develop distributed applications. A remote call is syntatically transparent when its syntax is exactly the same as the local one and a remote call is semantically transparent when the semantics of the call are the same as the local. Syntactically transparent is achievable, but total semantic transparency is a challenging task[?]. When a remote procedure call is executed, the steps that the call takes are illustrated in Figure 7.1[?]. The user process makes a normal local procedure call which invokes a client-stub procedure that corresponds to the actual procedure call. The client-stub packs the calling parameters into a message and passes it to the transport protocol. The transport protocol transmits the message to the server using the communication network. The server's transport protocol receives the message and passes it to the server-stub that corresponds to the procedure that is being invoked. The server-stub unpacks the calling parameters and then calls the procedure that needs to be executed. Once execution is complete, the response message is packed with the results returned by the procedure and then sent back to the client via the server's transport protocol. The message is received by the client's transport protocol that passes it to the client-stub to unpack the results and then return control to the local call with the results in the parameters of the procedure call[?]. In the process described above, the crucial point to be appreciated that despite the occurrence of a large number of complicated activities (most of which have been shown in simplied fashion), these are entirely transparent to the client procedure which invokes the remotely executed procedure as if it were a local one. Contrast this to a client who must perform a remote function without the use of RPCs. Such a program would minimally have to worry about the syntax and implementation details of issues like creating and communicating using sockets, ensuring that such communication is reliable, performing error handling, worrying about network byte order during transmission, breaking up voluminous data into smaller sizes for transmission on account of limitations of data transmission using sockets etc. Such a comparison shows the importance of RPC to a programmer wishing to develop distributed applications in a simpler more convenient and less error-prone fashion. Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 282 Caller machine User User-stub local call pack argument Caller machine Network RPCRuntime transmit RPCRuntime Call packet receive Server-stub Server unpack argument call wait local return unpack result importer receive work Result packet transmit exporter pack result return importer interface exporter interface Figure 7.1: Main steps of an RPC Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES 283 7.3 RPC Design Issues 7.3.1 Client and Server Stubs One of the integral parts of the RPC mechanism is the client and server stubs. The client and server stub's purpose is to manipulate the data contained in a request and response messages so that it is suitable for the transmission over the network and for use by the receiving process. The process of taking the arguments and results of a procedure call, assembling them for transmission across the network, and disassembing them on arrival is known as marshaling[4]. In a remote call, the client will pack the calling arguments into a format suitable for transmission and unpack the result parameters in the response message after the complete execution of the procedure. The server will unpack the parameter list of the request message and pack the result parameters in the response message. 7.3.2 Data Representation The architecture of the machine on which the remote procedure call is executed need not necessarily be the same as that of the client's machine. This may mean that the two machines support dierent data representations such as ASCII or EBCDIC etc. Even if both machines have the same data representations they may dier in the byte lengths for representing dierent data structures. For example an integer may be represented as four bytes on the client and eight bytes on the server. To further complicate matters, some machines use dierent byte ordering for data structures. To overcome the above mentioned incompatibilities in data representation on different machines, most RPC systems dene a special data representation of their own which is used when transmitting data structures between machines during a remote call or when returning from one. When making a remote call, the client must convert the procedure arguments into the special data representation and the server decodes the incoming data into the data representation that is locally supported by the host. Since both client and server both understand the same special data representation, the above problem is solved. An example of special data representation is the Sun XDR (External Data Representation). The data representation may employ either implicit or explicit typing. In implicit typing, only the value of a data element is transmitted across the network, not the type. This approach is used by most of the major RPC systems such as Sun RPC, Xerox Courier and Apollo RPC. The explicity typing is the ISO technique for data representation (ASN.1 (Abstract Syntax Notation I)). With this representation the type of each data eld is transmitted along with its value. The encoded type also contains the length of the value being sent. The disadvantage of such an approach is Draft: v.1, April 4, 1994 284 CHAPTER 7. REMOTE PROCEDURE CALLS the overhead spent in the decoding. There are three dieret types for representing data in RPC systems: Network Type, Client Type, and Server Type. Network Type: In this type, the entity that is sending the data converts the data from the local format to a network format and the receiving entity converts the network format into its local format. This approach is attractive when we try to realize heterogeneous RPC systems. If the number of dierent architectures on the network is large and varied, this is the logical choice when designing the system. The disadvantage is that a change of format has to be performed twice for each transmission. Client Type: In this type, the client transmits the data across the network in its own local format. However it also sends a short encoded eld (typically a byte) which indicates its own data format. The server must then convert the received data by calling an appropriate conversion routine based on the encoded information of the client's data representation. The advantage of this approach is that the process of conversion of data representation now occurs only once (at the server end), instead of twice as in the network type data representation. The disadvantage of this approach is that the server must now be provided with the capability of being able to convert a variety of dierent data representations into its own. Whenever a client machine with a new data representation is added to the network, all the servers must be modied and provided an additional routine to be able to convert the new data representation of the new machine to their own. Server Type: In this approach the client is required to know the data representation format of the server and to convert the data to the server's representation before transmitting the data. In this way only one conversion is done (at the client end). The client can determine the server's data representation if this information can be procured from the binder. For this to be possible, the server must inform the binder of its data representation when it registers itself with the binder. 7.3.3 RPC Interface Stubs may be generated either manually or automatically. If generated manually, it becomes the server programmer's responsibility to write separate code to act as a stub. The programmer is in a position to handle relatively complex parameter passing fairly easily. Automatically generated stubs require the existence of a parameter description language that we refer to an Interface Description Language (IDL). In eect IDL is Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES Stub object code link Stub source code compile Programmer 285 Client/server code (a) Programmer Stub source code compile processor link Stub object code desc. lang. parameter Client/server interface* Client/server code (b) * The interface is written in the parameter description language Figure 7.2: Creation of stubs. (a) manually, (b) automatically used to dene the client-server procedure call interface. Interface denitions are then processed to produce the source code for the stubs. The stubs code can then be compiled and linked (as in manual generation) to the client or server code. These are diagrammatically shown in Figure 7.2. The interface language provides a number of scalar types, such as integers, reals, Booleans, characters, and facilities to dene structure. If the RPC system is part of a language that does support interface procedure denition, such as Cedar RPC, an interface language is not needed[4]. When RPC is to be added to languages with no interface denition capabilities like C or Pascal, an interface language is needed. Examples include the Amoeba Intertace Language (AlL, used in the Amoeba System), and rpcgen (used in the Sun RPC system). These denitions in the interface language can be compiled into a number of different languages enabling clients and servers written in dierent languages to communicat using RPC via the interface compiler. When the interface language is compiled with the client or server application code, a stub procedure is generated for each procedure in the interface dention. The interface language should support as many languages as possible. Also, the interface compiler is the basis used for the integration of remote procedure calls into existing programming languages. When a RPC mechanism is built into a particular language, it is required that both client and server applications use the same language which limits the exibility and expandability of dierent and new services in the system. Draft: v.1, April 4, 1994 286 CHAPTER 7. REMOTE PROCEDURE CALLS 7.3.4 Binding An important design issue of an RPC facility is how a client determines the location and identity of the server. Without the location of the server, the client's remote procedure call cannot take place. We refer to this as binding. There are two aspects of the binding. First, the client must locate a server that will execute the procedure even when the server location changes[2]. Second, to ensure that the client and server RPC interfaces were compiled from the same interface[2]. There are two approaches to implement the binding mechanism. 1. Compile the address of the server into the client code at compile-time. This approach is not practical since the location of resources could change at any time in the system due to server failures or the server just being moved. If any of these conditions occur, the clinets would having to be recompiled with new server address. This approach allows no exibility in RPC systems. 2. Use of a binder to locate the server address for each client at run-time. The binder is usually resident on a server and contains a table of the names and locations of all currently supported services, together with a checksum to identify the version of the RPC interface used at the time of the export. Also an instance identier will be needed to dierentiate identical servers that export the same RPC interfaces. When a server starts executing, its interface is exported to the binder with the information specied above to uniqely identify its procedures. Also, the server must be able withdraw their own instances by informing the binder. When the clients start executing, it can be bound to an instance of a server by making a request to the binder for particular procedures. This request will contain a checksum of the parameters it expects so the binder can ensure the client and server were built with the same RPC interface denition. This request only needs to be made once, not for every call to the procedure. The only other time the client should make any binding request is if it detects a crash of one of the servers it uses. In addition to any functional requirements, the design of the binder must be robust against failures and should not become a performance bottleneck. One solution to this issue was used by the Cedar RPC system. It involved distributing the binder among several servers and replicating information across the servers[1]. Consequently, the binding mechanism of the system did not rely totally on one server's operations. This allows the binder to continue operation when a server that contains a binder carshes. The one drawback of the binder spread across multiple servers is the added complexity needed to control these binders. Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES 287 7.3.5 Semantics of Remote Calls One main goal of RPC systems is to provide a transparent mechanism to access remote services. Semantic transparency is an important design consideration in RPC systems. Syntatic transparency can be achieved by using the interface denition and stub generators, but semantic transparency in two areas have to be addressed[2]. Call Semantics Call semantics determine how the remote procedure is to be performed in the presence of a machine or communication failure. There are four reasons that lead to the failure of a remote procedure once its initiated: the call request message can be lost, the server can crash, the response message can be lost, and the client can crash. Under normal circumstances, we would expect a remote call to result in the execution of the procedure exactly once; that is the semantics of a local call. However, whenever network communication is involved it is not rare to encounter abnormal circumstances. As a result the ideal semantics of local procedure calls are not always realized. RPC systems typically may have three dierent types of call semantics. 1. At Least Once: This implies that the remote procedure executed at least once, but possibly more than once. This semantics will prevent the client from waiting for time-out indenitely since it has has a time-out mechanism on the completion of the call. To guarantee that the RPC is executed, the client will repeatedly send the call request to the server whenever the timeout is reached, until a reply message is received or it sure that the server has crashed. This type of call semantics is used in some systems mainly because it is simple to implement. However, this semantics has the drawback of having the procedure being executed more than one time. This property makes this type of call semantics inappropriate to apply to a banking system; if a person is withdrawing $100 dollars and it is executed several times, the problem is obvious. A possible solution when using at-least-once semantics is to design an idempotent interface in which error will not occur if the same procedure is executed several times. 2. At Most Once: This implies that the procedure may be executed once or not at all, except when the server crashes. If the remote call returns normally then we conclude that the remote procedure was executed exactly once. If an abnormal return is made, it is not known whether the procedure executed or not. The possibility of multiple execution of the procedure, however, is ruled out. When using this semantics the server needs to keep track of client requests to be used to detect duplicate requests and must be able to return old data in response message if the response fails. Therefore, this semantics requires a Draft: v.1, April 4, 1994 288 CHAPTER 7. REMOTE PROCEDURE CALLS more complex protocol to be designed. This call semantics is used in Cedar RPC system[1, 4]. 3. Exactly Once: This implies that the procedure is always executed exactly once even when server crashes[2]. This is idealistic and it is unreasonable to expect this because of possible server crashes. Most RPC systems implement and support at-most-once call semantics; some provide merely at-least-once semantics. The choice of semantics supported aects the kind of procedures that the applications programmer is allowed to write. ldempotent procedures: These are procedures that may be executed multiple number of times instead of just once, without any harm being done or with the same net eect. Such procedures may be written regardless of what call semantics are supported. Non-idempotent procedures: These procedures must be executed only once to obtain the net eect desired. If executed more than once they produce a dierent result from that expected (e.g., a procedure to append data to a le). These procedures may be written if at-most-once semantics are provided by the RPC system. Therefore, to allow writing of such procedures it is very desirable that the RPC system support at-most-once semantics. 7.3.6 Parameter Passing In normal local procedure calls, it is valid and reasonable to pass arguments to procedures either by value or by reference (i.e., by pointers to the values). This is true as all local procedures share the same address space on the host they are executed on. However, when we view remote procedure calls, we realize that the address space of the local and remote procedures is not shared. Hence the passing of arguments by reference to procedures executing remotely is entirely invalid and pointless. Hence most RPC systems that use non-shared remote address space insist that arguments to remote procedures be passed by value only. Only very few RPC systems for closed systems, where address space is shared by all processes in the system allow the passing of arguments by reference. Passing Long Data Elements Passing large data structures such as large arrays by value may not be possible given the limitations on packet size for transport level communication. One solution to this problem is suggested by Wilbur and Bacarisse, wherein dedicated server processes are Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES client 289 server CALL: pass handle CALL - BACK REPLY: results REPLY: results Nested call-back for remote procedure. Can be used as a possible solution to the problem of passing large parameters to remote procedures where the call-back is to get part of the large parameter of the first client call. Figure 7.3: Nested call-back for remote procedure considered which will supply parts of large arguments as and when needed. For this purpose the client must be able to specify to the remote procedure the identity of such a server. This may be done by passing a server handle, using which the remote procedure may use the server (within the client itself) to obtain its large parameters. This obviously involves nested remote procedure calls, with the server being able to make a nested remote call to the client7.3. Passing Parameters by Reference Although most current RPC systems disallow passing of parameters by reference due to non-shared address space between client and server, this need not necessarily be so. In fact if remote procedure calls are to gain popularity with programmers it is desirable that they be able to imitate local calls as much as possible. Hence it is very desirable that an RPC system be able to support calls in which parameters are pointer or references. Sometimes it is possible to approximate the net eect of passing a pointer parameter to a remote procedure. Tanenbaum suggests one way of doing this if the client stub is aware of the size of the data structure that the pointer is pointing to. If this is the case, then the client can pack the data structure in to the message to be sent to the server. The server stub then invokes the remote procedure not with the original pointer but a new pointer which points to the data structure passed in the message. This data structure is then passed back to the client stub in order that it may know of any changes made to the Draft: v.1, April 4, 1994 290 CHAPTER 7. REMOTE PROCEDURE CALLS data structure by the server. The performance of this operation can be improved if the client stub has even greater information as to whether the reference is an input or output device for the remote procedure. If it is an output device the buer pointed to by the pointer need not be sent at all by the client stub, and instead only the server will pack the data structure during the return from the server. On the other hand, if the client stub is aware that the reference is an input parameter only, then it can send that information to the server stub. The server stub would then refrain from sending the data stmcture back to the client as it would not have undergone any modication. This eectively improves the communication performance by a factor of 2. The above discussion is only valid if the client stub indeed has the required information pertaining to reference parameters. This can be achieved by specifying the format of the parameters while formally specifying the client stub perhaps using parameter description language for stubs described earlier. 7.3.7 Server Management This issue addresses how the servers in the RPC system will managed since it directly aects the semantics of a procedure call. There are three dierent strategies that can be used to manage RPC servers: static server, server manager, and stateless server. Static Server: This approach is the simplest possible approach and it is based on the idea of having client arbitrarily select a server to use for it's calls[2]. The diculty with this approach is that each server must now maintain state information concurrently as there is no guarantee that when a client makes a second call, the same server would be allocated to it as the one for the rst call. These servers may not be dedicated to a single client, i.e. a server may be required to serve several clients in an interleaved fashion. This introduces the additional diculty that each server must maintain concurrently the state for each client separately. If this were not so sharing of a remote resource between several clients would become impossible since the dierent clients may attempt to use the resource in conicting manners. Server Manager: This approach uses the server manager to select which server a particular client will interface with[2]. When the client makes a remote call, the binder instead of returning the address of a server returns the address of a server manager. When the client calls the server manager, the manager returns the address of a suitable server which is thereupon dedicated to the client. The main advantage of using a server manager is that load balancing is not an issue as each server serves only one client. Also, due to the dedicated servers, each server now need only to maintain state information for one client. Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES 291 Stateless Server This is less common approach to server management in which each call to the server results in a new server instance[2]. In this case, there is no state information retained between calls, even from the same client. Therefore, state information has to be passed to the server in the request message which adds to the overhead of the call. In many implementations of RPC system the static server is used since it usually expensive to produce a server per client in a distributed system. The server must be designed so that interleaved or current requests can be serviced without aecting each other. There are two types of delays that a server should be concerned with, the local and remote delay[2]. The local delay occurs when a process can not execute a call becuse a resource is being used currently. This degrades the system performance if the server can not service incoming message requests during this delay. This type of delay occurs in servers that are designed with a single server process which only allows a service to be called at a time. Therefore, the designer of an RPC must produce an ecient server implementation for servicing multiple requests concurrently. One possible solution is to desing the server with one process for receiving incoming requests from the network and several worker processes that are designated to actually execute the procedure[1, ?]. The distributor and worker processes are referred to as lightweight processes that all share the same address space. This implementation of the server is illustrated in Figure 7.4. The distributor polls for incoming messages in a loop, and puts them into a queue of message buers. The message buers are implemented using shared memory. The distributor as well as the worker processes can access them. When a worker process is free, it extracts an entry from the message queue and starts executing the required procedure. When done, the worker process replies directly to the calling client process. It obtains the client's address from the message buer entry. If there is another pending entry in the queue the worker processes commences on that or else it temporarily suspends execution. The queue of message buers must be controlled by a monitor-The functions of the monitor would typically include providing mutual exclusion on the shared variables, memory and resources including the message queue. It ensures that there is no conict in with the distributor putting entries in the message queue and in worker processes extracting entries from it. It must awaken worker processes that may be suspended for want of access to a shared resource. The client distributor associate incoming replies with their associated clients because when a client makes a call, it inserts a unique message ld into the request message. The server merely copies the unique message Id into the reply message. The listening process at the client routes the reply to the appropriate client based on the unique message ld. This scheme therefore requires that clients at a given machine Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 292 Server Shared Memory Worker Worker Incomming Message Message Buffers Worker Distributor Worker Worker Shared Variables Figure 7.4: Implementing server with lightweight processes be able to generate unique identiers amongst themselves. 7.3.8 Transport Protocol An important design consideration for an RPC system is selecting a suitable transport protocol for the transfer of data and control between the client and the server. One important criterion is to choose the protocol that minimizes the time between the initializing of a procedure call and the receiving of its results[1]. RPC mechanisms are usually built using datagram communications since the RPC messages are mostly short and the establishment of a connection is undesirable overhead especially whe local area networks are reliable[4]. Generally speaking, one can identify three RPC transport protocols that include the Request protocol, the Request/Reply/Acknowledgement protocol, and the Request/Reply protocol[4]. The request protocol is used when there is no value to be returned to the client and the client requires no conrmation of the exchange of the procedure. This protocol supports the maybe semantics and is the least used RPC protocol because of its limited exibility. The Request/Reply/Acknowledgement protocol requires the client to acknowledge the reply message from the server. This protocol is rather inecient since the acknowledgement really does not add anything to the protocol. Once a client receives the reply Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES 293 with the results of the call, the call is complete. This property of RPC makes the Requst/Reply protocol the most widely used protocol. Another important feature of the transport protocol is its ability to detect failures and server crashes. One simple mechanism to achieve this is for the protocol to send a probe message to the server during communications that require an acknowledgement[1]. If the client periodically sends a probe to the server and the server responds to these probes, the server is operational. If the server does not respond to the probes, the assumption can be made that the server has crashed and an exception should be raised. Another expansion that the protocol should have is the ability to send messages larger than one datagram; the protocol should support a multiple packet transfer mechanism. Two of the more popular transport protocols in use today and oered on the BSD 4.2 version of the UNIX are the Unreliable Datagram Protocol (UDP) and the Transmission Control Protocol (TCP)[4]. The TCP communication is a connection-based mechanism which requires the communicating processes to establish a connection between them before transmission occurs. This causes additional communication overhead when transmitting messages. The mechanism is mainly used in RPC system when extremely high reliability is important, multiple execution of procedures would cause problems and multiple packet transfers are frequent[5]. On the other hand, the UDP communication mechanism is a datagram connectionless-based protocol. It does not need to establish a connection between the communicating processes. UDP is popular in UNIX based RPC systems in which multiple execution of procedures will not cause a problem, the servers support multiple clients concurrently, and have mostly single packet transfers[5]. The RPC system desinger are normally faced with three options to choose a transport protocol. 1. Use an existing transport protocol (such as TCP or UDP) and an existing implementation of this protocol. 2. Use an existing transport protocol, but develop an implementation of this protocol which is specialized for the RPC system. 3. Develop an entirely new transport protocol which will serve the special requirements of an RPC system. The rst approach is the simplest, but it will provide the poorest performance. The second approach is considered when high performance is the prime issue, the third option is really the only viable one. The argument put forward by Birrell and Nelson in support of the last option is the following. Existing protocols such as TCP are suitable for data transfer involving large chunks of data at a time in byte stream. Draft: v.1, April 4, 1994 294 CHAPTER 7. REMOTE PROCEDURE CALLS Based on experiments conducted by them using existing protocols, they conclude that for short, quick request-response type of communication involved in RPC systems it is desirable that a new faster protocol be developed specially for the RPC system. The future trend in the design of RPC systems will predictably be in designing lightweight transport protocols. Signicant performance improvement can be obtained if the remote call and reply messages can be sent by bypassing the existing complex protocol at the transport and data link layers and instead implementing simple, light high-speed protocols for message communication as discussed in Chapter 4. 7.3.9 Exception Handling So far we have been able to model remote procedure calls fairly close to local procedure calls semantically. However, this is true for as long as no abnormal events occur either at the client or server end or in the network in-between. The fundamental dierence between remote calls and local calls becomes apparent when errors occur. There are a lot of entities, both hardware and software, that interact when a remote call is made. A fault in any of these can lead to a failure of the remote call. Some possible obvious errors are discussed below. Lost remote call request from client stub The client stub sends out a message to a server for a remote call and the request is lost possibly due to a failure in the network transmission. This problem can be solved if the client stub employs a timer and retransmits the request message after a time-out. Lost reply from sever stub The reply message from the server stub to the client stub after successful execution of the remote procedure may be lost before it reaches the client stub. Reliance on the client stub's timer to solve this problem is not a reasonable solution if we consider that the remote call is a non-idempotent procedure. We cannot allow the procedure to be executed several times for each retransmission from the client before the client successfully gets a reply from the server. A solution to this problem is for the client stub to tag each transmission with a sequence number. If a server executes a procedure and later sees a retransmission with a larger sequence number it refrains from re-execution the procedure. This approach results in at-most-once semantics for the RPC system. Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES 295 7.3.10 Server crash The server may crash either before it ever received the request message or it may crash after having executed the procedure but before replying. The unfortunate part is that the client has no way of knowing which of the two cases applied. In either case the client will time-out. This may be approached in two ways. The server may be allowed either to reboot or an alternate server may be located from the binder and the request message retransmitted. This results in at-least-once semantics, and is unsuitable for RPC systems which want to support non-idempotent remote procedure calls. The second option is for the client stub not to retransmit after a time-out and merely report failure back to the client. This ensures at-most-once semantics and non-idempotent procedure calls pose no problem. Client Crash It is possible that a client may crash after it has sent a request message for a remote call but before it gets a reply from it. Such a server computation has been labelled as an orphan, as it is an active computation with no parent to receive the reply. For one thing orphans waste processing resources. More importantly, they may tie up some valuable resources in the server. This is probably the most annoying of all the possible failures in an RPC system as it is rather dicult to handle in an elegant and eective manner. Several authors have diering views on how to approach this problem. One of the more reasonable approaches suggested by Nelson, is that when the client reboots after a crash, it should broadcast a message to all other machines on the network indicating the start of a new "time frame of computation" termed as an epoch. On receiving an epoch broadcast message each machine attempts to locate the owners of any remote computations being executed on it. If no owner is found the orphan is killed. The problem with killing an orphan, however, is that it creates problems with respect to allocated resources. For example an orphan may have opened les based on le handles provided by the client. If these are closed when the orphan is killed, the question arises what if that le was also being used by some other server also invoked by the client. Another suggestion made by Shrivastava and Panzieri is that servers should be atomicised. That is the server either executes from start to nish with a successful reply to the client or it does not do anything. This all-or-nothing approach is conceptually good but leads to diculties in its implementation. For example, the server would be required to maintain information which would enable it to undo its processing if it discovers that it has become an orphan. Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 296 client server CALL: with parameters End of server execution. No reply required. Non-blocking RPC. The client is free to continue execution after a call to remote procedure that does not return a reply. Can potentially be used as a powerful tool to develop distributed applications with a high degree of concurrency. Figure 7.5: Non-blocking RPC 7.3.11 Non-blocking RPC The RPC model discussed so far has assumed a synchronous blocking mechanism model. Some RPC systems however do support a non-blocking model, most notably the Athena project at MIT. Non-blocking RPC is conceptually possible for remote calls that do not return any reply to the client. This is shown in Figure 7.5. The client is free to continue execution after a call to remote procedure that does not return a reply. This feature can be used as a poweul tool to develop distributed applications with a high degree of concurrency. Once a call to a remote procedure is made, the client is able to continue further processing concurrently with the execution of the remote procedure. One way of achieving this is if the client stub that makes the remote calls is aware of whether the remote procedure will be returning a reply or not. It is not sucient for the client stub to perform a non-blocking remote call for every procedure that does not return a reply. It becomes the programmer's responsibility to specify whether or not a call to a remote procedure should be blocking or nonblocking at time of dening the remote procedure using a procedure interface language described earlier. Non-blocking RPC can potentially become a very powerful tool in the development of distributed applications with a high degree of concurrency. This is particularly true if it can be combined with the concept of nested remote procedure calls. Draft: v.1, April 4, 1994 7.3. RPC DESIGN ISSUES 297 7.3.12 Performance of RPC The performance loss in using RPC is usually of a factor of 10 or even more. However, as shown by performance indicators from dierent experimental RPC systems, this depends greatly on the design decisions incorporated into the RPC system. Typically trade-os need to be made between enhanced reliability and speed of operation when designing the system. The performance of RPC systems can be analyzed in terms of the time spent during the call time. Wilbur and Bacarisse present a look at the time components that go into making a remote call and are as follows: 1. parameter packing 2. transmission queuing 3. network transmission 4. server queuing and scheduling 5. parameter unpacking 6. execution 7. results packing 8. transmission queuing 9. network transmission 10. client scheduling 11. results unpacking These parameters can be combined into four paramters: parameter transformations, network transmission, execution and operating system delays. Parameter transformations are an essential part of the RPS mechanism when dealing with heterogeneous RPC systems. However these can be optimized as discussed under the discussion on data representations. For large parameters, network transmission time may well become a bottleneck in RPC systems. This is especially true for most current RPC systems as most of them are networked using slow 10 Mbits/s Ethernet. However if the physical transmission medium were to be changed to optic transmission media, we can expect a signicant improvement in RPC performance. However, for remote calls with very small or even no parameters, the bottleneck is indisputably the operating system overhead. This includes page swapping, network queuing, routing, Scheduling etc. delays. Draft: v.1, April 4, 1994 298 CHAPTER 7. REMOTE PROCEDURE CALLS 7.4 Heterogeneous RPC Systems Distributed system involve increased programming complexity along two dimension, they require dealing with system dependent details of maintaining and communication, and requires dealing with these details across a wide range of languages, systems and networks. This section focuses on the main issues that must be addressed during the design of an RPC system that runs on heterogeneous computer systems. Its goal is to radically decrease the marginal cost of connecting a new type of system to an existing computing environment and at the same time to increase the set of common services available to the users. The RPC facility on heterogeneous computer systems attempts to subside three problems. First, one problem of inconvenience, in which individual either must be a user of multiple subsystem or else must accept the consequence of isolation from the various aspects of the local computing environment and that is not acceptable. A second problem is of expense, the hardware and software infrastrncture of computing environment is not eectively amortized, making it much more costly than necessary to conduct specic research on a system best suited for it. A third problem is diminishing research eectiveness, Scientist and engineers should be doing other useful things rather than to hack around heterogeneous computing systems. In this section, we describe the RPC system (HRPC) associated with the Heterogeneous Computer Systems (HCS) project developed at the University of Washington[?]. The main goal of this project is to develop software support for heterogenous computing environment and not to develop software systems that make them act and behave as homogeneous computing environment. Consequently, this approach will lead to reducing the cost of adding new computing systems or resources as well as increasing the set of common resources. The HRPC provides clean interfaces between the ve main components of any RPC system: Stubs Binding Data Representation Transport Protocol Control Protocol In such a system, a client or a server and its associated server can consider each of the remaining four components as a black box. The design and use of an RPC-based distributed application can be divided into three phases: compile time, bind time and call time. In what follows, we discuss how the tasks involved in each phase have been implemented in the HRPC system. Draft: v.1, April 4, 1994 7.4. HETEROGENEOUS RPC SYSTEMS 299 7.4.1 HRPC Call Time Organization The call-time components of RPC facility are the transport protocol, the data representation, and the control protocol. Functions of these components and their interfaces with one another is called Call-time organization. In traditional RPC facility, all decision regarding the implementation of the various components are made at the time of RPC facility design such as transport protocol, control protocol, date representation etc, that will make task rather simplied at run-time. At this point it has to perform only binding, that is to acquire location of the server. The binding typically is performed by the client, and is passed to the stub as an explicit parameter of each call. As for example, DEC SRC RPC system delays the choice of transport protocol until bind time, when the choice of protocol which is invisible to both client and server, it can be made based on availability and performance. The HRPC Call Time Interface As mentioned in[?], the choice of transport protocol, data representation, and control protocols are delayed until bind-time, allowing a wide variety of RPC programs. There are well dene call-time interface between the components of HRPC call time interface. In traditional RPC systems, it keeps small interface because at the compile time great deal of knowledge concerning RPC call is been provided to client and server stubs. While in HRPC such information is not available until bind -time. The Control Component The control components has three routines associated with each direction (send or receive) and role (client or server). Distinction is required because control information is generally dierent in request and reply messages. Routines for send are as follows: Call: client uses for sending initial call message to some service. InitCall: performs any initialization necessary prior to beginning the call. CallPacket: perform any functions peculiar to the specic protocol, when it is necessary to actually send a segment of complete message. FinishCall: terminates the call. Routines for receiving are as follows: Request: perform function associated with a service receiving a call. Reply: used when the server replies to the client request. Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 300 Answer: concerned with receiving that reply on the client end. CloseRpc: used by user level routines to notify the RPC facility that its service are no longer needed. Data Representation Components The number and purpose of the data representation procedures are driven by the interface description language (IDL). There is one routine for each primitive type and/or type constructor dened by the IDL. HRPC uses modied Courier as the IDL. Each routine translates an item of a specic data type in host machine representation to/from some standard over-the-wire format. Each routine is capable of encoding and decoding data, which is determined by the state of call. Transport Components The functions of the transport routines are of opening and closing a logical link according to the peculiarity of particular transport protocol, and sending or receiving a packet. Some important routines are as follows: Init: initialize transmission/reception related to a single call message. Finish : terminate transmission/reception of a message. BufAlloc: allocates memory buer for message. Dealloc: deallocates memory buer for message at transport level. This allows transport specic processing to occur in an orderly manner. After dening above components, let's discuss their interfacing with one another as shown in Figure ??. Transport component provides network buer and send and receives in this buers. The control component is responsible for calling transport component routines that initialize for sending or receiving, and for obtaining appropriate buers. The control component uses data representation component to insert protocol specic control information into the message being constructed. Data representation component don't ll or empties data buers because it does not know the distinction between user data and RPC control information, it also never directly access transport functions. Control level function is called to dispose the buer when it is full. Data representation routines don't have information concerning the placement of control information within a message and so it has to call control information. Also data representation routines are ignorant of where they are operating, in client or server context, this distinction may be important to control routines. Draft: v.1, April 4, 1994 7.4. HETEROGENEOUS RPC SYSTEMS 301 Server Client RPC call RPC return RPC call server stub client stub control RPC return control data rep data rep transport transport RPC Message transport control data Figure 7.6: HRPC Call Time Organization The data representation component communicates with control component via the routines GetPacket, PutPacket implemented by AnswerPacket or RequestPacket. These routines isolate the buer-lling routines from the role context that they are operating in. But in the course they will know the direction, sending or receiving. 7.4.2 HRPC Stubs Client stubs implement the calling side of an HRPC and server stubs implement the called side of HRPC, which is described in detail in[?]. HRPC Stub Structure The stub routines has build into them detailed knowledge of binding structures, and of the underlying HRPC components. The stub issue the sequence of calls to underlying component routines that implement the HRPC semantics. Client and server stubs produced by the stub generator upon processing one specic interface specication expressed in the Courier IDL. The input and output parameters used in the call to the user routine are dened within the server stub itself. Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 302 HRPC Stub Generator The purpose of stub routines is to help insulate user-level code from the details and complexity of the RPC runtime system. The interface to an RPC service is expressed in an interface description language. A stub generator accepts this interface description as input and produces appropriate stub routines in some programming language and they are compiled and linked with the actual client and server code. The HRPC system uses a stub generator for an extended version of the Courier lDL. The stub routine code which is responsible for marshalling parameters and actually calling user-written routines, server stub contains a dispatcher module for elding incoming request message and deciding based on message content, which procedure within the server interface is being called. It allows user to dene their own marshalling routines for complicated data types such as those containing pointer references. Stub generator may take on dierent values for interface depending upon the conguration of the call time components used to make the call. It doesn't prevent client(server) from talking servers (client). The use of "higher level" data representation protocol, in conjuction with the use of an IDL, allows the data content of message to be mutually comprehensible by stubs written in dierent language. The HRPC stubs does not provide direct support for the marshalling of data types involving pointers. 7.4.3 Binding Binding is a process by which client become associated with the server before the RPC can take place. So before a program can make remote call, it must posses a handle for the remote procedure. The exact nature of the handle will vary from implementation to implementation. The process by which the client gets hold of this handle is known as binding. There are few points about binding worth mentioning. The binding can be done at any time prior to making the call and is often left until run time. The binding may change during the client program's execution. It is possible for a client to be bound to several similar servers simultaneously. Binding in Homogeneous Systems Homogeneous environment involves the following steps for binding: Naming: In a view of the binding sub-system, Naming is a process of mapping client specied server name to the actual network address of the host on which Draft: v.1, April 4, 1994 7.4. HETEROGENEOUS RPC SYSTEMS 303 the server resides, and also provide some kind of identication aid for detecting any interface mismatch. Activation The next step after Naming is activation. In some RPC design is of creating a server process. In some system it is been generated automatically or assumed to be active. Port Determination Port is an addressable entity. Network address of the host is not sucient for addressing server, because multiple server might be active on a single host. So each server is allocated its own communication port, which can be uniquely identied with together with network address. Thus client stub's outgoing message and server stubs reply message can use such location information. Binding in Heterogeneous Systems Basically heterogeneous system consists of many dierent kind of machines connected via network. The complications that result from the heterogeneity are as follows: In Heterogeneous system choice of RPC components are not xed at implementation time but they are selected dynamically. Hence HRPC binding must perform additional processing for such selection. this task involves rst to select components and then conguring the binding data struc ture accordingly. HRPC binding must proceed in a manner that can emulate the binding protocols for each of the system being accommodated. let's review some of the common mechanism for Naming, Activation and Port Determination, to accommodate variety of binding protocols in Heterogeneous Computer Systems. Naming: Naming information is typically replicated for high availability and low access latency. Common implementation techniques include a variety of name service and replicated le schemes. Activation: In some systems, user is responsible for pce-activating server process. In other systems, server processes are auto-activated by the binding mechanism at import time. Port Determination: Common implementation techniques here are more diverse than for either naming or activation, due to varying assumption about the volatility of the information. Some common techniques used are as follows: Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 304 { Well-Known Ports: In this technique, clients determine which port numbers to be use based on a widely published document or le. Programmer who wish to build a widely exported server must apply for a port number from centrally administered agency. For less widely exported services, a xed port is often taken from range of uncontrolled ports with this information" hard coded" in to both the client and the server. { Binding Agent: In this technique, a process resides at a well-known port on each host and keeps track of which port numbers are assigned to servers running on that host. On such systems, the process of server export includes telling the binding agent which port was allocated to the server. The import side of this scheme can take several forms. i.e. Sun RPC binding subsystem contacts a building agent on the server host, and queries that agent for the port numbe { A Server-Spawning "Daemon" Process: In this technique, a process resides at a well known port on each host., and handles server import requests by creating a new port, forking o the appropriate process, passing it the active port, and telling the client's import subsystem the new port number. Structure of HRPC Binding As mentioned in[?]. The Binding adopted by HRPC facility has to be richer in information than that of traditional RPC facility because the particular choice of each of three components, namely Transport Component, Control Component and Data Representation Component, are delayed until bind-time. Also HRPC binding has to include information about location of server and also how to interact. Each of three call-time components are pointed by the separate block of procedural pointers of HRPC binding. These component routines are called via such *pointers. This indirection allows the actual implementation of the components routines to be selected at bind-time. Information specic to particular component is stored in a private area associated with each component. Application program never directly access the component structure directly, it deals with binding via HRPC system call, acquire it, discard it as atomic type. HRPC Binding Process This section discusses how HRPC binding process accommodates the identied types, naming, activation, port determination and component conguration to build the binding data structure needed to allow calls to proceed. First look at HCS Name Service (HNS)(see Figure g:hns. Draft: v.1, April 4, 1994 7.4. HETEROGENEOUS RPC SYSTEMS 305 HNS capable client import HNS Direct access in response to import Other name Service Server’s name service export Insular Server Figure 7.7: Basic HNS structure Each of the systems in the HCS network has some sort of naming mechanism, and the HNS accesses the information in these name service directly, rather than providing a brand new name service in which to reregister the information from all the existing name services. Thus HNS provides a global name space from which server can access information. HNS maps names into the names stored in the underlying name services. Such design allows newly added insular services to be available immediately to the HCS network because design allows insular systems to evolve without consulting or having to register their presence directly with HNS. Thus eliminating some consistency and scalability problems inherent in a reregistration based naming mechanism. HCS-capable services are available to insular clients, although the eort required is more since a single HCS-capable export operation can potentially involve in placing information into each system being accommodated. There are four pssible client-server situations and are as follows: Case 1: Both client and server are insular. HRPC does not provide support for such situation because it eventually involves modifying the native RPC system. In this situation client and server can communicate directly or they can't communicate. Of course there are ways to build such gateway for communication using HRPC but it then involves in above mentioned problem. Case 2: The client is insular and the server is HCS-capable. At import time, whatever information needed by client's import subsystem will be made available in Draft: v.1, April 4, 1994 306 CHAPTER 7. REMOTE PROCEDURE CALLS client's environment by the HRPC runtime system. It mainly involves policy issues for the HNS and component conguration. Case 3: The client is HCS-capable and server is insular. This is more complex situation. At export time, it is up to HCS-capable client to extract whatever information the server's export subsystem placed into its environment at export time. It involves accommodating all four of the areas of binding heterogeneity that are identied as Naming, Activation, Port determination and Component conguration. This situation is also more interesting because it allows HCS programs to take advantage of the pre-existing infrastructure in the network, which provides substantial number of services. Case 4: Both client and server are HCS-capable: This situation is a special case of the case 2 and case 3. The binding process follows the following steps: To import an insular server, and HCS-capable client, it is required to specify two part string name containing the type : File service and instance: host name of the desired service. HRPC binding subsystem will rst query the HNS to determine the naming information associated with the server. This information consists of a sequence of binding descriptor records. A binding descriptor consists of a designator indicating which control component, data representation component and transport component the service uses, a network address, a program number, a port number and a ag indicating whether the binding protocol for this particular server involves contacting a binding agent. Fill in the parts of binding data structure; ll combination of control protocol, data representation and transport protocol that is understood by both the server and client. The procedural pointers are now points to routines to handle that particular set of components. Draft: v.1, April 4, 1994 7.5. SUN RPC 307 7.5 SUN RPC This section describes the main RPC issues in terms of Sun RPC implementation. Also described is the development process of a Sun RPC application. Sun RPC is the primary mechanism for network communication within a Sun network. All network services of the SunOS are based on Sun RPC and External Data Representation, XDR. XDR provides portability. The Network File System, NFS, is also based on Sun RPC/XDR. Each RPC procedure is uniquely dened with a program number and procedure number. The program number represents a group of remote procedures. Each remote procedure has a dierent procedure number. Each program also has a version number. When a program is updated or modied, a new program number does not need to be assigned, instead the version number is updated for the existing program number. The portmap is Sun's binding daemon which maintains the mapping of network services and Internet addresses (ports). There is a portmap daemon on each machine in the network. The portmap daemons reside at known (dedicated) port, where they eld requests from client machines for server addresses. There are three basic steps to initiate an RPC: 1. During initialization, the server process calls its host machine's portmap to register its program and version number. The portmap assigns a port number for the service program. 2. The client process then obtains the remote program's port by sending a request message to the server's portmap. The request message contains the server's host machine name, the unique program and version number of the service, and the interface protocol (eg, TCP). 3. The client process sends a request message to the remote program's port. The server process uses the arguments of the request message to perform the service and returns a reply message to the client process. This step represents one request-reply (RR) RPC cycle. If the server doesn't complete the service successfully, it returns a failure notication, indicating the type of failure. The Sun RPC protocol is independent of transport protocols. The Sun RPC protocol involves message specication and interpretation. The Sun RPC layer does not implement reliability. The application must be aware of the type of transport protocol underneath Sun RPC. If the application is running on top of an unreliable transport such as UDP, the application must implement its own retransmission and Draft: v.1, April 4, 1994 308 CHAPTER 7. REMOTE PROCEDURE CALLS timeout policy, since the RPC protocol does not provide it. The RPC protocol can, however set up the timeout periods. Sun RPC does not use specic call semantics since it is transport independent. The call semantics can be determined from the underlying transport protocol. Suppose, for example, suppose the transport is unreliable UDP. If an application program retransmits an RPC request message after a time out and the program eventually receives a reply, then it can infer that the remote procedure was executed at least once. On the other hand, suppose, the transport is a reliable TCP, a reply message implies that the remote procedure was executed exactly once. Sun RPC Software is supported on both UDP and TCP transports. The transport selection depends on the application programs needs. UDP may be selected if the application can live with at-leastonce call semantics, the message sizes are smaller than the UDP packet size (8Kbytes), or the service is required to handle hundreds of clients. Since UDP does not keep client state information, it can handle many clients. TCP may be selected if the application needs high reliability, at-mostonce call semantics are required, or the message size exceeds 8 Kbytes. Sun RPC Software uses eXternal Data Representation (XDR), which is a standard for machine-independent message data format. This allows communication between a variety of host machines, operating systems, and/or compilers. Sun RPC Software can handle various data structures regardless of byte orders or layout conventions by always converting them to XDR before sending the data over the network. Sun calls the marshalling process of converting from a particular machine representation to XDR format serializing. Sun calls the reverse process deserializing. XDR is part of the presentation layer. XDR uses a language to describe data formats. It is not a programming language. The XDR language is similar to the C language. It is used to describe the format of the request and reply messages from the RPCs. The XDR standard assumes that bytes are portable. This can be a problem if trying to represent bit elds. Variable length arrays can also be represented. Sun provides a library of XDR routines to transmit data to remote machines. Even with the XDR library, it is dicult to write application routines to serialize and deserialize (marshal) procedure arguments and results. Since the details of programming applications to use RPCs can be time consuming, Sun has provided a protocol compiler called rpcgen to simplify RPC application programming. There are three basic steps to develop an RPC program: 1. Protocol specication 2. Creation of server and client application code 3. Compilation and linking of library routines and stubs Draft: v.1, April 4, 1994 7.5. SUN RPC 309 The rst step in the process of developing an RPC program is dening the clientserver interface. This interface denition will be an input into the protocol compiler rpcgen, so the interface denition uses RPC language, which is similar to C. This interface denition le contains a denition of the parameter and return argument types, the program number and version, and the procedure names and numbers. The rpcgen protocol compiler inputs a remote program interface denition written in RPC Language, RPCL. The output of rpcgen is Sun RPC Software. The Sun RPC Software includes client and server stubs, XDR lter routines and a header include le. The client and server stubs interface with the Sun RPC library, removing the network details from the application program. The server stub supports inetd, therefore, the server can be started by inetd or from the command line. rpcgen can be invoked with an option to specify the transport to use. An XDR lter routine is created for each user dened type in the RPCL (RPC Language) interface denition. The XDR lter routines handle marchalling and demarchalling of the user dened type into and from an XDR format for network transmission. The header include le contains the parameter and return argument types and server dispatch tables. Although rpcgen is nice in that it can provide most of the work for you, in some cases it can be overly simplistic. rpcgen may not provide a needed service to a more complicated application program. If this is the case, rpcgen can still be used to provide a 'starting point' for the low level RPC code. Or, a programmer can create an entire RPC program application without using rpcgen at all. The second step in the process of developing an RPC program is creating the server and client application code. For the server, the service procedures specied in the protocol are created. Developing the client application code involves making remote procedure calls for the remote services. These procedure calls will actually be local procedure calls to client stub procedures, which coordinate the communication and activation of the remote server. Prior to making these procedure calls, the client application code must call the server machine's portmap daemon to obtain the port address of the server program and to create the transport connection as either UDP or TCP. Note that the application code can be written in a language other than C. The last step in the process of developing an RPC program is compilation of the client and server programs and linking the application code with the stubs and the XDR lter routines. The client application code, the client stubs, and the XDR lter routines are compiled and linked together creating an executable client program. The server application code, server stubs and XDR lter routines are compiled and linked to obtain an executable server program. The server executable can then be started on a remote machine. Then the client executable can be run. The development of the RPC program is complete. Draft: v.1, April 4, 1994 310 CHAPTER 7. REMOTE PROCEDURE CALLS 7.5.1 SUN RPC Programming Example This subsection provides all the information regarding SUN RPC Library and how to develop distributed applications using this library. Remote Procedure Call Library Interprocess communication mechanism provided by RPC library for communicating between the local and remote process is message passing. Accordingly, the arguments are sent to the remote procedure in a message and the results are passed back to the calling program in a message. The RPC library handles this message passing scheme and the application need not worry how the messages are get to and from the remote procedure. The RPC Library delivers messages using a transport, which provides communication between applications running dierent computers. One problem with passing arguments and results in a message is that the dierences in the local and remote computers can lead to dierent interpretations of the same message. XDR routines in the RPC library provide a mechanism to describe the arguments and results in a machine independent manner, allowing you to work around any dierences in the computer architectures. RPC Library uses client/server model. In this model, servers oer services to the network which the clients can access. Another way to look at the model is that servers are resource providers and clients are resource consumers. Examples include le server, print server. There are two types of servers, stateless and stateful servers. A stateless server does not maintain any information, called state, about any of the clients whereas a stateful server maintains client information from one remote procedure call to the next. For example, a stateful le server maintains information such as le name, open mode, current position after the rst remote procedure call. The client just passes the le descriptor and the number of bytes to read to the server. In constrast, a stateless le server doesn't maintain any information regarding clients and the clients have to pass all the information into the read request including le name, the position to within the le to begin reading, and number of bytes to be read. Though the stateful servers are more ecient and easy to use than the stateless servers, stateless servers have advantage over stateful servers in the presence of failures. When a server crashes, recovery will be very easy if it is stateless than if it is stateful. Depending on the application one is better than the other. The RPC Library has many advantages. It simplies the development of a distributed applications by hiding all the details of the network programming. It provides a exible model which supports a variety of application designs. It also hides operating system dependencies, thus making the applications built on RPC Library very portable. Draft: v.1, April 4, 1994 7.5. SUN RPC 311 BIG-ENDIAN BYTE ORDER n n+1 n+2 n+3 Magnitude System/370 Increasing Significance MSB has lowest address Magnitude SPARC Increasing Significance LITTLE-ENDIAN BYTE ORDER n+3 n+2 n+1 n Magnitude VAX MSB is in byte with Increasing Significance highest address 8086 Magnitude Increasing Significance =Postion of "sign" bit Figure 7.8: Example 32-bit integer representation eXternal Data Representation (XDR) Dierent computers have dierent architectures and run dierent operating systems, thus they often represent the same data types dierently. For example, Figure 7.8 shows the representations of 32-bit integers on four computers. All four use two's complement notation with the sign bit adjacent to the most signicant-magniture bit. In the representation used by System/370 and the SPARC, the most signicant bit is in the byte with the lowest address (big-endian representation) and in the 80x86 and VAX representation, the most signicant bit is in the byte with highest address (little-endian). Because of this integers are not portable between little-endian and big-endian computers. Byte ordering is not the only incompatibility that leads to non-portable data. Some architectures represent integers as one's complement, oating point representations may vary between dierent computers. Because of this diversity, there is a need for standard representation of data which is machine independent and which enables data to be portable. XDR provides such a standard. This section describes XDR in detail and explains its uses in RPC. XDR Data Representation XDR is a standard for the description and encoding of data. It is useful for transferring data between computers with dierent architectures and running dierent operating systems. XDR standard assumes that bytes are portable. Draft: v.1, April 4, 1994 312 CHAPTER 7. REMOTE PROCEDURE CALLS XDR standard denes a canonical representation of data, viz. a single byte order (big endian), single oating point representation (IEEE), and so on. Any program running on any computer creates a portable data by converting it from the local representation to XDR representation, which is machine independent. Any other computer can read this data, by rst converting it from XDR representation to its local representation. Canonical representation has a small disadvantage. When two little endian computers are communicating, the sending computer will convert all the integers to big-endian (which is standard) before sending and the receiving computer will convert from big-endian to small-endian. These conversions are unnecessary as both of them have the same byte order. But this conversion overhead is very small when compared to the total communication overhead. An alternative to the canonical standard is to have multiple standards plus a protocol that species which standard has been used to encode a particular piece of data. In this approach, the sender precedes the data by a 'tag' which describes the format of the data. The receiver makes conversions accordingly, hence this approach is called 'receiver makes it right'. The data types dened by XDR are a multiple of four bytes in length. Because XDR's block size is four bytes, reading a sequence of XDR-encoded objects into a buer results in all objects being aligned on four-byte boundaries provided that the buer itself is so aligned. This automatic alignment ensures that they can be accessed eciently on almost all computers. Any bytes added to an XDR object to make its length a multiple of four bytes are called ll bytes; ll bytes contain binary zeros. XDR Library XDR library is a collection of functions that convert data between local and XDR representations. The set of functions can be broadly divided into two - those that create and manipulate XDR streams, and those that convert data to and from XDR streams. XDR stream is a byte stream in which data exists in XDR representation. XDR lter is a procedure that encodes and decodes data between local representation and XDR representation and reads or writes to XDR streams. The following details more about XDR streams and XDR lters. XDR streams The XDR stream abstraction make XDR lters to be media independent. XDR lter can read or write data to the XDR stream where this stream can reside on any type of media - memory, disk etc. Filter interacts with XDR streams, whereas XDR streams interact with actual medium. There are three kinds of streams in XDR Library, viz. standard i/o, memory and record streams. Valid operations that can be performed on a stream are XDR ENCODE encodes the object into the stream. Draft: v.1, April 4, 1994 7.5. SUN RPC 313 XDR DECODE decodes the object from the stream. XDR FREE releases the space allocated by an XDR DECODE request. Standard I/O Streams A standard I/O stream connects the XDR stream to a le using the C standard I/O mechanism. When data is converted from local format to the XDR format, it is actually written to a le. And when data is being decode, it is read from the le. The synopsis of XDR Library routine used to create a standard I/O stream is void xdrstdio_create(xdr_handle, file, op) XDR *xdr_handle; FILE *file; enum xdr_op op; xdr handle is a pointer to XDR handle which is the data type that supports the XDR stream abstraction. le references an open le. op denes the type of operation being performed on the XDR stream. The standard I/O streams are unidirectional, either an encoding or decoding stream. Memory Streams The memory streams connects the XDR stream to a block of memory. Memory streams can access an XDR-format data structure located in memory. This is most useful while encoding and decoding arguments for a remote procedure call. The arguments are passed through from the calling program to the RPC Library where they are encoded into an XDR memory stream and then passed on to the networking software for transmission to the remote procedure. The networking software receives the results and writes them into the XDR memory stream. The RPC Library then invokes XDR routines to decode the data from the stream into the storage allocated for the data. The synopsis of XDR Library routine used to create a memory stream is void xdrmem_create(xdr_handle, addr, size, op) XDR *xdr_handle; char *addr; u_int size; enum xdr_op op; xdr handle and op are the same as for standard I/O streams. The XDR stream data is written to or read from a bolck fo memory at location addr whose length is size bytes long. Memory streams are also unidirectional. Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 314 Record 0 Length Data Fragment 0 Length Data Fragment 0 Length Data Fragment 0 Length Data Fragment 0 Length Data Fragment 1 Length Data Fragment 31 bits 1 bit: 0= Fragment 1= Last Fragment Figure 7.9: Record Marking Record Streams Records in a record stream are delimited as shown in Figure ??. A record is composed of one or more fragments. Each fragment consists of a four byte header followed by 0 to 231 ; 1 bytes of data. A header consists of two values: a bit that, when set indicates the last fragment of a record and 31 bits specify the length of the fragment data. The synopsis of XDR Library routine used to create a record stream is void xdrrec_create(xdr_handle, sendsize, recvsize, iohandle, readit, writeit) XDR *xdr_handle; u_int sendsize, recvsize; char *iohandle; int (*readit)(), (*writeit)(); This routine initializes the XDR handle pointed to by xdr handle. The XDR stream data is written to a buer of size sendsize and is read from a buer of size recvsize. The iohandle identies the medium that supplies records to and accepts records from the XDR stream buers. This argument is passed on to readit() and writeit() routines. readit() routine is called by the XDR lter when the XDR stream buer is empty and writeit() routine is called when the buer is full. The synopsis of readit() and writeit() which are the same. int Draft: v.1, April 4, 1994 7.5. SUN RPC 315 func(iohandle, buf, nbytes) char *iohandle, *buf; int nbytes; iohandle argument is the same as the one specied in the xdrrec create() call and can be a pointer to the standard I/O FILE type. buf is the address of teh buer to read data into for the readit() routine and is the address of the buer to write data from for the writeit() function. Unlike the standard I/O and memory streams, record streams can handle encoding and decoding in one stream. Selection can be done bye setting x op eld in the XDR handle before calling a lter. XDR Filters XDR lters are trifunctional: 1. encode a data type. 2. decode a data type. 3. free memory that a lter may have allocated. Basically the lters can be categorized into three as described below. Primitive Filters The XDR Library's primitive lters are listed in Figure ??. They correspond to the C programming language primitive types. The rst two arguments of all the lters are same no matter what kind of data is being encoded or decoded; primitive lters have only these two arguments. First is a pointer to the XDR stream handle and teh second argument is the address of the data of interest and is referred to as an object handle. The object handle is simply a pointer to any possible data type. Composite Filters In addition to primitive lters, XDR library provides composite lters for commonly used data types. The rst two arguments of composite lters are same as those for primitive lters, a pointer to XDR handle and a pointer to object handle. The additional arguments depend on the particular lter. An example of a composite lter is xdr string() whose synopsis is given below. bool_t xdr_string(xdr_handle, strp, maxsize) XDR *xdr_handle; char **strp; u_int maxsize; This lter translates between strings and their corresponding local representations. The other examples include lters for array, union, pointer etc. Draft: v.1, April 4, 1994 316 CHAPTER 7. REMOTE PROCEDURE CALLS Custom Filters Filters provided by the XDR library can be used construct lters for programmer-dened data types. These lters are referred to as custom lters. An example of custom lter shown below. struct date{ int day; int month; int year; }; bool_t xdr_date(xdr_handlep, adate) XDR *xdr_handlep; struct date *adate; { if (xdr_int(xdr_handlep, &adate->day) == FALSE) return(FALSE); if (xdr_int(xdr_handlep, &adate->month) == FALSE) return(FALSE); return (xdr_int(xdr_handlep, &adate->year)); } RPC Protocol As mentioned before RPC Library uses message-passing scheme to handle communication between the server and the client. The RPC protocol denes two types of messages, call messages and reply messages. Call messages are sent by the clients to the servers requesting them to execute a remote procedure. After executing the remote procedure, the server sends a reply message to the client. All the elds in the message are XDR standard types. Figure 7.10 shows the format of the call message. First eld XID is the transaction identication eld. Client basically puts a sequence number into this eld. This is mainly used to match reply messages to outstanding call messages, it will be helpful when the reply messages arrive out of order. Next eld, message type distinguishes call messages from reply messages. It is 0 for call messages. Next eld is the RPC version number, used to see if the server supports the particular version of RPC. Following that, come the remote program, version and procedure numbers. The identify uniquely the remote procedure to be called. Next are two elds, client credentials and client veriers, that identify a client user to a distributed application. Credential identies and a verier authenticates, just the name on an international passport identies the bearer while the photograph authenticates the bearer's identity. Draft: v.1, April 4, 1994 7.5. SUN RPC 317 XID (unsigned) Message Type (integer=0) RPC Version (unsigned=2) Program Number (unsigned) Version Number (unsigned) procedure Number (unsigned) Client Credentials (struct) Client Verifier (struct) Arguments (procedure-defined) Figure 7.10: Call message format Figure 7.11 shows the format of the reply message. Two kinds of reply messages are possible: replies to successful calls and replies to unsuccessful calls. Success is dened from the point of view of RPC Library and not the remote procedure call. Successful reply message has a transaction ID (XID) and message type is set to 1 identifying it as a reply. The reply status eld and accept status eld together distinguish a successful reply from an unsuccessful one. Both elds are 0 in a successful reply. There is a server verier. Final eld in the reply message has the results returned by the remote procedure. Unsuccessful reply messages have the same format as successful replies up to the reply status eld, which is set to 1. The format of the remainder of the elds depends on the condition that made the call unsuccessful. Portmap Network Service Protocol A client that needs to make a remote procedure call to a service must be able to obtain the transport address of the service. The process of translating a service name to its tranport address is called binding to the service. This subsection describes how binding is performed by the RPC mechanism. A transport service provides process to process message transfers across the network. Each message contains network number, host number and port number. A port is a logical communication channel in a host; by waiting on a port, a process receives messages from the network. A sending process does not send the messages directly to the receiving process but to the port on which the receiving process is Draft: v.1, April 4, 1994 318 CHAPTER 7. REMOTE PROCEDURE CALLS XID (unsigned) Message Type (unsigned =1) Reply Status (integer =0) Server Verifier (struct) Accept Status (integer =0) Results (procedure-defined) Figure 7.11: Successful reply message format waiting. A portmap service is a network service that provides a way for a client to look up the port number of any remote server program which has been registered with the service. The portmap program or portmapper is a binding service. It maintains a portmap, a list of port-to-program/version number correspondences on this host. As the Figure 7.12 shows, both the client and the server call the portmapper procedures. Server program calls its host portmapper to create a portmap entry. The clients call portmappers to obtain information about portmap entries. To nd a remote program's port, a client program sends an RPC call message to the server's portmapper; if the remote program is registered on the server, the portmapper returens ther relevant port number in an RPC reply message. The client program can then send RPC call messages to the remote program's port. The portmapper is the only network service that must have an well-known(dedicated) port - port number 111. Other distributed applications can be assigned port-numbers so long as they register their ports with their host's portmapper. Clients and servers query and update a server's portmap by calling the portmapper procedures listed in Table ??. After obtaining a port, a server program calls the Unset and Set procedures to its server's portmapper. Unset clears a portmap entry if there is one, and the Set procedure enters the server program's remote program number and port-number into the portmap. To nd a remote program's port-number, a client calls the Getport procedure of the server's portmapper. Dump procedure gives a server's complete portmap. The portmapper's callit procedure makes an indirect remote procedure call as shown in Figure 7.12. The client passes the target procedure's Draft: v.1, April 4, 1994 7.5. SUN RPC 319 Client Machine Network Server Machine A 3 Client Program 111 2 Portmapper B 1 4 C Server Program D Ports Legend: 1. Server registers with portmapper 2. Client requests server’s port from portmapper 3. Client gets server’s port from portmapper 4. Client calls server Figure 7.12: Typical portmapping sequence program number, version number, procedure number and arguments in an RPC call message to callit. Callit looks up the target procedure's port-number in the portmap and sends an RPC call message to the target procedure. When the target procedure returns results to callit, it returns the results to the client program. Number 0 1 2 3 4 5 Table 7.1: Name Description Null Do nothing Set Add portmap entry Unset remove portmap entry Getport Return port for remote program Dump return all portmap entries Callit Call remote procedure Selected process management calls in Mach 7.5.2 RPC Programming Remote procedure call programming uses the RPC mechanism to invoke procedures on remote computers. Before a remote procedure on a remote computer is invoked, we need to have some way to identify the remote computer. Computers have unique Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 320 Procedure 0 Version 0 . . . Procedure i Program 0 . . . Procedure 0 Version m . . . Procedure j Service . . . Procedure 0 Version 0 . . . Procedure i Program n . . . Procedure 0 Version p . . . Procedure l Figure 7.13: Remote Procedure Identication Hierarchy identiers (host names) to distinguish them from other computers on the network. A client uses the host name of the server to identify the computer running the required procedure. In SUN RPC, a remote procedure can accept one argument and return one result. If more than one argument need to be passed, they are put in a structure and the structure is passed to the remote procedure. Similarly if more than one result need to be returned, they are grouped in a structure. Since there can be dierences in the representation of data on client and server, the data is rst encoded into XDR format before it is sent to the remote computer. The receiving computer will decode the XDR formatted data to its local format. Usually XDR lters in the XDR Library will do the encoding and decoding. Remote Procedure Identication The remote procedures are organized as shown in Figure 7.13. A network service consists of one or more programs, a program has one or more versions, and a version has one or more procedures. A remote procedure is uniquely identied by a program number, version number and a procedure number. 2000000-3FFFFFFF is the range of locally administered program numbers. Users can use the numbers in this range to identify the programs they are developing. The procedure numbers usually start at 1 and are allocated sequentially. The RPC Library provides routines that let you maintain an RPC program number database in /etc/rpc. Each entry in this data base has the name of a distributed service, list of alias names for this service, and the program number of this service. Draft: v.1, April 4, 1994 7.5. SUN RPC 321 Upper level Client Side callrpc() Server Side registerrpc() Server Program svc_run() Client Program lower Lever Clent side Server Side cInt_create() cIntudp_create() cInttcp_create() cIntraw_create() cInt_destroy() cInt_call() cInt_control() cInt_freeres() svcudp_creat() svctcp_create() svcraw_create() svc_destroy() svc_getargs() svc_freeargs() svc_register() svc_unregister() svc_sendreply() svc_getreqset() Transport Library Figure 7.14: RPC library organization This database allows the user to specify the name of a distributed service instead of a program number and use getrpcbyname() to get the program number. This allows the user to change the program number of a distributed service without recompiling the service. 7.5.3 The RPC Library The RPC library is divided into client side and server side as shown in the Figure 7.14. Also the set of routines on each side is divided into upper and lower level routines. The high-level routines are easy to use but are inexible where as low-level routines provide exibility. These routines are discussed in detail below. High Level RPC programming High-Level RPC programming refers to the simplest interface in the RPC library for implementing remote procedure calls, but it is very inexible compared to the lowlevel RPC routines. As shown in the Figure 7.14 the high level routines consist of callrpc(), registerrpc(), svc run(). The detailed description of these routines follows. registerrpc() routine is used for registering a remote procedure call with the RPC Library. Its synopsis is as shown. int Draft: v.1, April 4, 1994 322 CHAPTER 7. REMOTE PROCEDURE CALLS registerrpc(prognum, versnum, procnum, procname, inproc, outproc) u_long prognum, versnum, procnum; char *(*procname()); xdrproc_t inproc, outproc; The rst three arguments are program number, version number and procedure number which identify the remote procedure being registered. The procname is the address of the procedure being registered. The inproc and outproc are the addresses of the XDR lters for decoding incoming arguments and encoding the outgoing results. registerrpc() should be explicitly called to register each procedure with the RPC Library. svc run() synopsis is void svc_run() This routine is the RPC Library's remote procedure dispatcher. This is called by the server after the remote procedure is registered. svc run() routine waits until it receives a request and then dispatches the request to the appropriate procedure. It takes care of decoding the arguments and encoding the results using XDR lters. The synopsis of callrpc() is as follows. int callrpc(host, prognum, vernum, procnum, inproc, in, outproc, out) char *host; u_long prognum, vernum, procnum; xdrproc_t inproc, outproc; char *in, *out; The host identies the remote computer and prognum, vernum, procnum identies the remote procedure. in and out are the addresses of the input arguments and the return values respectively. inproc is the address of the XDR lter that encodes the arguments and outproc is the address of the XDR lter that decodes the return values of the remote procedures. An example of a RPC program written using only high-level RPC routines is given below. The rst le is the header le sum.h which has the common declarations for the client and server routines. /****************************************************************** * This is the header file for writing client and server routines * using high-level rpc routines. ******************************************************************/ Draft: v.1, April 4, 1994 7.5. SUN RPC 323 #define SUM_PROG ((u_long)0x20000000) #define SUM_VER ((u_long)1) #define SUM_PROC ((u_long)1) struct inp_args { int number1; int number2; }; extern bool_t xdr_args(); The next le is the client side le client.c, which makes the rpc call. /********************************************************************** * This is the main client routine which makes a remote procedure call. **********************************************************************/ #include <stdio.h> #include <rpc/rpc.h> #include "sum.h" main(argc, argv) int argc; char *argv[]; { struct inp_args input; int result; int status; fprintf(stdout, "Input the two integers to be added: "); fscanf(stdin,"%d %d", &(input.number1), &(input.number2)); status = callrpc(argv[1], SUM_PROG, SUM_VER, SUM_PROC, xdr_args, &input, xdr_int, &result); if (status == 0) fprintf(stdout,"The sum of the numbers is %d\n", result); else fprintf(stdout,"Error in callrpc\n"); Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 324 } The following is the listing of the server.c which has the server side routines. /********************************************************************** * This file has the server routines. Main registers and calls a * dispatch routine which dispatches the incoming RPC requests. **********************************************************************/ #include <rpc/rpc.h> #include "sum.h" main() { int *sum(); if (registerrpc(SUM_PROG, SUM_VER, SUM_PROC, sum, xdr_args, xdr_int) == -1){ printf(" Error in registering rpc\n"); return(1); } svc_run(); } int *sum(input) struct inp_args *input; { static int result; result = input->number1 + input->number2; return(&result); } The last le is the xdr.c which has a xdr routine for encoding and decoding arguments. /****************************************************************** * This file has the XDR routine to encode and decode the arguments. Draft: v.1, April 4, 1994 7.5. SUN RPC 325 ******************************************************************/ #include <rpc/rpc.h> #include "sum.h" bool_t xdr_args(xdr_handle, obj_handle) XDR *xdr_handle; struct inp_args *obj_handle; { if (xdr_int(xdr_handle, &obj_handle->number1) == FALSE) return(FALSE); return(xdr_int(xdr_handle, &obj_handle->number2)); } The advantages of using High-Level RPC programming routines is, it is easy to implement a network service since these routines hide all the network details and provide a transport-independent interface. The disadvantages of using these routines are it is highly inexible, it doesn't allow user to specify the type of transport to use (defaults to UDP), it doesn't allow user to specify the timeout period for callrpc() routine (default of 25 seconds is used). Low-Level RPC Programming Low-Level RPC programming refers to the lowest layer of the RPC programming interface. This layer gives most control over the RPC mechanism and is the most exible layer. As mentioned before RPC uses transport protocols for communication between the processes on dierent machines. To maximize its independence from transports, RPC Library interacts with transports indirectly via sockets. Socket is a transient object used for interprocess communication. RPC Library supports two types of sockets: datagram sockets and stream sockets. A datagram socket provides an interface to datagram transport service and a stream socket provides an interface to a virtual circuit transport service. Datagram transports are fast but unreliable whereas stream sockets are slower but more reliable than the datagram sockets. Two abstractions of the RPC Library that isolate the user from the transport layer are transport handle of type SVCXPRT and the client handle of type CLIENT. Usually these are passed to the RPC routines and are not accessed by the user directly. When a RPC-based application is being developed using low-level routines the following needs to be done on server side. 1. Get a transport handle. Draft: v.1, April 4, 1994 326 CHAPTER 7. REMOTE PROCEDURE CALLS 2. Register the service with the portmapper. 3. Call the library routine to dispatch RPCs. 4. Implement the dispatch routine. and on the client side, the following needs to be done. 1. Get a client handle. 2. Call the remote procedure. 3. Destroy the client handle when done. Each of the above steps is explained in detail below. Server side routines Three routines can be used to get a transport handle viz. svcudp create(), svctcp create(), svcraw create(). As it is clear from the names, svcudp create() is used to create a transport handle for the User Datagram Packet (UDP) transport, svctcp create() gets a transport handle for the Transmission Control Protocol (TCP) transport and the function svcraw create gets an handle for the raw transport. Synopses of these routines are as follows. SVCXPRT * svcudp_create(sock) int sock; SVCXPRT * svctcp_create(sock, sendsz, recvsz) int sock; u_long sendsz, recvsz; SVCXPRT * svcraw_create() sock is an open socket descriptor. Once the transport handle is available, the service needs to be registered, which is done using svc register(). The synopsis of the routine is: bool_t svc_register(xprt, prognum, versnum, dispatch, protocol) SVCXPRT *xprt; u_long prognum, versnum; void (*dispatch)(); u_long protocol; Draft: v.1, April 4, 1994 7.5. SUN RPC 327 This routine associates the program number, prognum, version number versnum with the service dispatch procedure, dispatch(). If protocol is nonzero, a mapping of triple (prognum, versnum, protocol) to xprt->xp port is established wiht the local portmapper. The synopsis of dispatch routine is, void dispatch(request, xprt) struct svc_req *request; SVCXPRT xprt; The argument request is the service structure, whihc contains the program number, version number and the procedure number associated with the incoming RPC request. This dispatch routine is invoked by the RPC Library when a request associated with this routine comes. The RPC Library routines svc getargs() is used to decode the arguments to a procedure and svc sendreply() is used to send the results of RPC to the client. Client side routines Four routines clnt create(), clntudp create(), clnttcp create(), clntraw create() are used to create a client handle. clntudp create(), clnttcp create(), clntraw create() get a handles for UDP, TCP and raw transport respectively. Synopsis of clnt create() is given below and synopses of other routines are similar. CLIENT * clnt_create(host, prognum, versnum, protocol) char *host; u_long prognum, versnum; char *protocol; host identies the remote host where the service is located. prognum and versnum are used to identify the remote program. protocol refers to the kind of transport used and is either \udp" or \tcp". Once a client handle is obtained, the procedure clnt call() can be used to initiate a remote procedure call. The synonpsis of this routine is enum clnt_stat clnt_call(clnt_handlep, procnum, inproc, in, outproc, out, timeout) CLIENT *clnt_handlep; u_long procnum; xdrproc_t inproc, outproc; char in, out; struct timeval timeout; Draft: v.1, April 4, 1994 328 CHAPTER 7. REMOTE PROCEDURE CALLS This routine calls the remote procedure procnum associated with client handle clnt handlep. in is the address of the input arguments and the out is the address of the memory location where the output arguments are to be placed. inproc encodes the procedure's arguments and outproc decodes the returned values. timeout is the time allowed for the results to come back. The following is an example of an application written using low-level RPC routines. The header le sum.h and the XDR routine le xdr.c are same as in the program written using high-level routines. Only the main les server.c and client.c are given below. The steps given above are commented in the program. #include <stdio.h> #include <rpc/rpc.h> #include "sum.h" main() { SVCXPRT *xport_handle; void dispatch(); /* 1. Get the transport handle. */ xport_handle = svcudp_create(RPC_ANYSOCK); if (xport_handle == NULL){ fprintf(stderr,"Error. Unable to create transport handle\n"); return(1); } /* 2. Register the service. */ (void)pmap_unset(SUM_PROG, SUM_VERS); if(svc_register(xport_handle, SUM_PROG, SUM_VERS, dispatch,IPPROTO_UDP) == FALSE){ fprintf(stderr,"Error. Unable to register the service.\n"); svc_destroy(xport_handle); return(1); } /* 3. Call the dispatch routine. */ svc_run(); fprintf(stderr,"Error. svc_run() shouldn't return\n"); svc_unregister(SUM_PROG, SUM_VERS); svc_destroy(xport_handle); return(1); } Draft: v.1, April 4, 1994 7.5. SUN RPC 329 /* 4. Implement the dispatch routine. */ void dispatch(rq_struct, xport_handle) struct svc_req *rq_struct; SVCXPRT *xport_handle; { struct inp_args input; int *result; int *sum(); switch (rq_struct->rq_proc){ case NULLPROC: svc_sendreply(xport_handle, xdr_void, 0); return; case SUM_PROC: if (svc_getargs(xport_handle, xdr_args, &input) == FALSE){ fprintf(stderr,"Error. Unable to decode arguments.\n"); return; } result = sum(&input); if (svc_sendreply(xport_handle, xdr_int,result) == FALSE){ fprintf(stderr,"Error. Unable to send the reply.\n"); return; } break; default: svcerr_noproc(xport_handle); break; } } int *sum(input) struct inp_args *input; { static int result; result = input->number1 + input->number2; Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 330 return(&result); } The following is the client program which makes a remote procedure call. #include <stdio.h> #include <rpc/rpc.h> #include "sum.h" struct timeval timeout = { 25, 0 }; main(argc, argv) int argc; char *argv[]; { CLIENT *clnt_handle; int status; struct inp_args input; int result; fprintf(stderr, "Input the two integers to be added: "); fscanf(stdin,"%d %d", &(input.number1), &(input.number2)); /* 1. Get a client handle. */ clnt_handle = clnt_create(argv[1], SUM_PROG, SUM_VERS, "udp"); if (clnt_handle == NULL){ fprintf(stderr,"Error. Unable to create client handle.\n"); return(1); } /* 2. Make an RPC call. */ status = clnt_call(clnt_handle, SUM_PROC, xdr_args, &input, xdr_int, &result, timeout); if (status == RPC_SUCCESS) fprintf(stderr,"Sum of the numbers is %d\n", result); else fprintf(stderr,"Error. RPC call failed.\n"); /* 3. Destroy the client handle when done. */ clnt_destroy(clnt_handle); Draft: v.1, April 4, 1994 7.5. SUN RPC 331 } 7.5.4 Additional Features of RPC This subsection discusses the additional features of SUN RPC like asynchronous RPC and broadcast RPC. Asynchronous RPC In a normal RPC discussed so far, a client sends a request to the server and waits for a reply from the server. This is synchronous RPC. In constrast, asynchronous RPC is one in which after sending a request to the server, a client does not wait for a reply, but continues execution. If it needs to obtain a reply, it has to make some other arrangements. There are three mechanisms in RPC Library to handle asynchronous RPC, viz. nonblocking RPC, callback RPC and asynchronous broadcast RPC. Nonblocking RPC can be used in simple situations when a one-way message passing scheme is needed. For example, when synchronization messages needs to be sent, it would be okay even if some of the messages are lost. Retransmissions and acknowledgements would be unnecessary. This type of nonblocking RPC could be accomplished by setting timeout value in the clnt call() zero which would indicate the client to timeout immediately after making the call. Callback RPC is the most powerful of the asynchronous methods. It allows fully asynchronous RPC communication between clients and servers by enabling any application to be both a client and a server. In order to initiate a RPC callback, the server needs a program number to call the client back on. Usually client registers a callback service, using a program number in the range 0x40000000-0x5FFFFFFF, with the local portmap program and registers the dispatch routine. The program number is sent to the server as part of the RPC request. The server uses this number, when it is ready to callback the client. The client must be waiting for the callback request. To improve the performance, the client can send the port number instead of program number to the server. Then the server need not send a request to the client side portmapper for the port number. If the client calls svc run() to wait for the callback requests, it will not be able to do any other processing. Another process needs to be spawned that calls svc run() and waits for the requests. In Broadcast RPC, the client sends a broadcast packet for a remote procedure to the network and waits for numerous replies. Broadcast RPC treats all unsuccessful responses as garbage and lters them out without passing the results from such responses to the user. The remote procedures that support broadcast RPC typically respond only when the request is successfully processed and remain silent when they detect an error. Broadcast RPC requests are sent to the portmappers, so only services Draft: v.1, April 4, 1994 332 CHAPTER 7. REMOTE PROCEDURE CALLS that register themselves with their portmapper are accessible via the broadcast RPC mechanism. The routine clnt broadcast() is used to do broadcast RPC. If a client has many requests to send but does not need to wait for a reply until the server has received all the requests, the requests can be sent in a batch to reduce the network overhead. Batch-mode RPC is suitable for streams of requests that are related but make more sense to structure them as seperate requests rather than one large one. The RPC requests to be queued must not, themselves, expect any replies. They are sent with a time-out value of zero as with non-blocking RPC. 7.5.5 RPCGEN Rpcgen is a program that assists in developing RPC-based distributed applications by generating all the routines that interact with RPC Library, thus relieving the application developers of the network details etc. Rpcgen is a compiler which takes in code written in interface specication language, called RPC Language (RPCL) and generates code for client stubs, server skeleton and XDR routines. Client stubs act as interfaces between the actual clients and the network services. Similarly server skeleton hides the network from the server procedures invoked by remote clients. Thus all the user needs to do is to write server procedures and link them with server skeleton and XDR routines generated by rpcgen to get an executable server program. Similarly for using a network service, the user has to write client programs that make ordinary local procedure calls to the client stubs. The gure ?? shows how a client program and server program are obtained from client application, remote-program protocol specication and server procedures. The rest of this section describes how a simple RPC-based application can be constructed using rpcgen. The protocol specication for the program is given below. /************************************************************************ * This is the protocol specification file for a remote procedure sum * which is a trivial procedure, taking in a struture of two integers and * returning the sum of them. ************************************************************************/ /***************************************************************** * The structure for passing the input arguments to the remote * procedure. *****************************************************************/ struct inp_args { int number1; int number2; Draft: v.1, April 4, 1994 7.5. SUN RPC 333 }; /**************************************************************** * The following is the specification of the remote procedure. ****************************************************************/ program SUM_PROG{ version SUM_VERS_1{ int SUM(inp_args) = 1; } = 1; } = 0x20000000; It denes one procedure, SUM, for a version, SUM VERS 1, of the remote program, SUM PROG. These three values uniquely identify a remote procedure. When this is compiled using rpcgen, we obtain the following. 1. A header le sum.h that hash denes SUM PROG, SUM VERS 1 and SUM. It also contains the declarations of the XDR routine, in this case it is xdr inp args(). 2. A le sum clnt.c that has the client stub routines which interact with the RPC Library. 3. A le sum svc.c which has server skeleton. The skeleton consists of main() routine and a dispatch routine sum prog 1(). Notice that main tries to create transport handles for both the UDP and TCP transports. If only one type of transport needs to be created, command line options for rpcgen should indicate that. The dispatch routine dispatches the incoming remote procedure calls to the appropriate procedure. 4. A le sum xdr.c which has the XDR routine for encoding and decoding to and from XDR representation. The listing of all the les generated when sum.x is compiled by rpcgen are given below. The following is the header le sum.h. /* * Please do not edit this file. * It was generated using rpcgen. */ #include <rpc/types.h> Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 334 struct inp_args { int number1; int number2; }; typedef struct inp_args inp_args; bool_t xdr_inp_args(); #define SUM_PROG ((u_long)0x20000000) #define SUM_VERS_1 ((u_long)1) #define SUM ((u_long)1) extern int *sum_1(); The following is the sum clnt.c /* * Please do not edit this file. * It was generated using rpcgen. */ #include <rpc/rpc.h> #include "sum.h" /* Default timeout can be changed using clnt_control() */ static struct timeval TIMEOUT = { 25, 0 }; int * sum_1(argp, clnt) inp_args *argp; CLIENT *clnt; { static int res; bzero((char *)&res, sizeof(res)); if (clnt_call(clnt, SUM, xdr_inp_args, argp, xdr_int,&res,TIMEOUT) != RPC_SUCCESS) { return (NULL); } return (&res); } Draft: v.1, April 4, 1994 7.5. SUN RPC 335 The following is the le sum svc.c which has the server skeleton and the dispatch routine. /* * Please do not edit this file. * It was generated using rpcgen. */ #include <stdio.h> #include <rpc/rpc.h> #include "sum.h" static void sum_prog_1(); main() { register SVCXPRT *transp; (void) pmap_unset(SUM_PROG, SUM_VERS_1); transp = svcudp_create(RPC_ANYSOCK); if (transp == NULL) { fprintf(stderr, "cannot create udp service."); exit(1); } if (!svc_register(transp, SUM_PROG, SUM_VERS_1, sum_prog_1, IPPROTO_UDP)) { fprintf(stderr, "unable to register (SUM_PROG, SUM_VERS_1, udp)."); exit(1); } transp = svctcp_create(RPC_ANYSOCK, 0, 0); if (transp == NULL) { fprintf(stderr, "cannot create tcp service."); exit(1); } if (!svc_register(transp, SUM_PROG, SUM_VERS_1, sum_prog_1, IPPROTO_TCP)) { fprintf(stderr, "unable to register (SUM_PROG, SUM_VERS_1, tcp)."); exit(1); } Draft: v.1, April 4, 1994 336 CHAPTER 7. REMOTE PROCEDURE CALLS svc_run(); fprintf(stderr, "svc_run returned"); exit(1); /* NOTREACHED */ } static void sum_prog_1(rqstp, transp) struct svc_req *rqstp; register SVCXPRT *transp; { union { inp_args sum_1_arg; } argument; char *result; bool_t (*xdr_argument)(), (*xdr_result)(); char *(*local)(); switch (rqstp->rq_proc) { case NULLPROC: (void) svc_sendreply(transp, xdr_void, (char *)NULL); return; case SUM: xdr_argument = xdr_inp_args; xdr_result = xdr_int; local = (char *(*)()) sum_1; break; default: svcerr_noproc(transp); return; } bzero((char *)&argument, sizeof(argument)); if (!svc_getargs(transp, xdr_argument, &argument)) { svcerr_decode(transp); return; } result = (*local)(&argument, rqstp); if (result != NULL && !svc_sendreply(transp, xdr_result, result)) { Draft: v.1, April 4, 1994 7.5. SUN RPC 337 svcerr_systemerr(transp); } if (!svc_freeargs(transp, xdr_argument, &argument)) { fprintf(stderr, "unable to free arguments"); exit(1); } return; } The last le which is generated by the rpcgen is sum xdr.c whose listing is given below. /* * Please do not edit this file. * It was generated using rpcgen. */ #include <rpc/rpc.h> #include "sum.h" bool_t xdr_inp_args(xdrs, objp) XDR *xdrs; inp_args *objp; { if (!xdr_int(xdrs, &objp->number1)) { return (FALSE); } if (!xdr_int(xdrs, &objp->number2)) { return (FALSE); } return (TRUE); } As we have seen in the previous subsections, all these routines were written by the user when high-level or low-level RPC routines were used while developing an application. The following gives the client main program which makes an ordinary procedure call to sum and the sum procedure itself which is called by the server skeleton. Draft: v.1, April 4, 1994 CHAPTER 7. REMOTE PROCEDURE CALLS 338 #include <stdio.h> #include <rpc/rpc.h> #include "sum.h" main(argc, argv) int argc; char *argv[]; { CLIENT *clnt_handle; int *result; inp_args input; clnt_handle = clnt_create(argv[1], SUM_PROG, SUM_VERS_1, "udp"); if (clnt_handle == NULL){ printf("Unable to create client handle.\n"); return(1); } fprintf(stdout,"Input the integers to be added : "); fscanf(stdin,"%d %d", &(input.number1), &(input.number2)); result = sum_1(&input, clnt_handle); if (result == NULL) fprintf(stdout,"Remote Procedure Call failed\n"); else fprintf(stdout,"The sum of the numbers is %d\n", *result); } The following is the sum procedure itself. #include <stdio.h> #include <rpc/rpc.h> #include "sum.h" int *sum_1(input) inp_args *input; { static int result; Draft: v.1, April 4, 1994 7.5. SUN RPC 339 Figure 7.15: result = input->number1 + input->number2; return(&result); } The gure 7.15 shows how the executable client and server are constructed. Draft: v.1, April 4, 1994 Bibliography [1] A. D. BirreIl, B. J. Nelson, "Implementing Remote Procedure Calls," ACM Transactions on Computer Systems, Feb. 1984 [2] S. Wilbur, B. Bacarisse, "Building Distributed Systems with Remote Procedure Calls," Software Engineering Journal, Sept. 87 [3] B. N. Bershad, D. T. Ching, et al, "A Remote Procedure Call Facility for Interconnecting Heterogeneous Computer Systems," IEEE Transactions on Software Engineering, Aug. 1987 [4] G. Coulouris, "Distributed Systems: Concepts and Design." [5] J. Bloomer, \Ticket to Ride: Remote Procedure Calls in a Network Environment," Sun World, November 1991, pp. 39-35. [6] S. K. Shrlvastava, F. Panzlerl, "The Design of a Rellable Remote Procedure Call Mechanism," IEEE Transactions on Computers, July 1982 [7] A. Tanenbaum, Modern Operating Systems, [8] W. Richard Stevens, UNIX Network Programming, Ch. 18 [9] M. Mallett, A Look at Remote Procedure Calls, BYTE, May 1991 [10] Gibbons, P. B. "A Stub Generator for Multilanguage RPC in Heterogeneous Environments", IEEE transactions on Software Engineering, Vol. SE-13, No. 1, Janua,y 1987, pp 77-86. [11] Bershad, B. N., Ching, D. T., Lazowska, E. D., Sanislo, J. and Schwartz, M. "A Remote Procedure Call Facility for Interconnecting Heterogeneous Computer Systems", IEEE Transaction on Software Engineering, Vol. SE-13, No. 8 August 1987, pp 872-893. 341 342 BIBLIOGRAPHY [12] Hayes, Roger and Schlichting, R. D.,"Facilitating Mixed Language Programming in Distributed System", IEEE Transactions on Software Engineering, Vol. SE-13, No. 12, December 1987. [13] Tanenbaum, A. 5., "Modern Operating Systems", Prentice Hall, ISBN 0-13588187-0 [14] Yemini, 5. A., Goldszmidt, G. 5., Stoyenko, A. D., Wei, Y, "CONSERT : A High-Level-Lan. guage Approach to Heterogeneous Distributed Systems", IEEE conference on distributed systems l989.pp 162-171. [15] Bloomer, John, "Ticket to Ride", Sunworld, November 1991, 39-55. [16] Birrell, Andrew D., and Nelson, Bruce Jay, "Implementing Remote Procedure Calls", XEROX CS 83-7, October 1983. [17] Coulouris, George F. ,and Dollimore, Jean, "Distributed Systems", AddisonWesley Publishing Co., 1988, 81-113. [18] McManis, Chuck and Samar, Vipin, "Solaris ONC, Design and Implementation of Transport-Independent RPC", Sun Microsystems Inc., 1991. [19] Sun Microsystems, "Network Programming Guide", Part Number 800385010, Revision A of 27 March, 1990, 31-165. Draft: v.1, April 4, 1994 Chapter 8 Interprocessor Communication: Message Passing and Distributed Shared Memory 343 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 344 8.1 Introduction In the development of a distributed computing archiecture, one of the primary issues that a design must address is that of how to coordinate the communication activity between concurrently executing processes. In dealing with the various possible forms of communication, it is possible to classify each form into one of two basic categories. The rst, known as synchronous communication, requires that the sending and receiving processes maintain a one-to-one interaction with each other in order to coordinate their activities. In this scenario, when a message is output, the send process waits for the receiving process to respond with an acknowledgement before continuing on with its own processing. This activity also occurs in the process which receives the message. At this end of the communication, the process must explicitly wait for a message to be received before sending out an acknowledgement for it. The second class of communication is known as asynchronous communication. In asynchronous communication, the sender of a message never has to wait for a receiver's response before proceeding with its execution. If a sender transmits multiple messages to a receiver, however, it must be designed so as to have sucient buer space to store the incoming message(s) or the message(s) The implementation of either synchronous or asynchronous communication depends upon the type of model utilized to dene the process interacting. There are three types of communication models: Remote Procedure Calls (RPC), message passing, and monitor. In the previous chapter, we addressed the design issues for remote procedure call systems. In this chapter we focus on message passing tools and shared distributed memory. Draft: v.1, April 21, 1994 8.2. MESSAGE PASSING 345 8.2 Message Passing A message is simply a physical and logical piece of information that is passed from one process to another. The composition of a typical message usually consists of a header and trailer, for use in dening the source and destination address as well as error checking information, and the message that contains the information to be transmitted. In this context, the message value is moved from one place to another where it can be edited, or modied. This is equivalent to passing information by value. The implementation of interprocess communication using the message passing scheme can be characterized into approximately tow dierent categories. The rst of these is known as the "Dialogue Approach". In this technique, a user must rst establish a temporary connection with the desired resource and then gain permission to access that resource. The user's request then initiates the response and activity of the resource. The resource, however, is not controlled by the user. The main reason to choosing this technique is that most of the communicating processes do exist on dierent computing environments; it is dicult in such environments to share global variables or make references to various possible environments. Furthermore, this scheme simplies the task of allowing a large number of processes to simultaneously communicate with one another. The second category for implementing message passing is known as "Mailbox System Approach". In this approach messages are not directly sent to and from the receiving and sending processes, but are rather redirected into an intermediate buer called a mailbox. This allows for information to be transferred without concern of whether or not it can be immediately processed, as it can be stored until a process is ready to access it. By creating an environment based on this principle, neither the sender nor the receiver is restricted to a specic number of messages output or processed for a given time period. This, thus, allows a greater freedom in their attempts to communicate. This section describes and compares dierent software tools available to assist in developing parallel programs on distributed systems. In particular, this section briey describes and compares Linda, ISIS, Express, and PVM. It also compares these software tools with RPC mechanism. There is also a signicant amount of work being done on changes to programming languages to develop a parallel FORTRAN or parallel C++. 8.2.1 Linda Linda is a programming tool (environment) developed at Yale University and is designed to support parallel/distributed programming using a minimum number of simDraft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 346 Operator OUT(u) Denition generates a tuple u and inserts it into the Tuple Space (TS) without blocking the calling process IN(e) nds a tuple in TS that has a format matching the template e and returns that tuple deleting it from TS. If no tuple matches the template, the calling process is blocked until another process generates a matching tuple. READ(e) same as IN() except that the tuple remains in TS INP(e) same as IN() except that if no matching tuple is found FALSE is returned and the process is not blocked (removed from Linda by 1989) READP(e) same as INP() except that the tuple remains in TS (removed from Linda by 1989) EVAL(u) used to create live tuples (added by 1989) Table 8.1: Linda operators ple operations [1, 2]. Linda language emphasizes the use of replication and distributed data structures [3]. It utilizes a 'Tuple' space as a method of passing data between processes and as a method of starting new processes that are part of that application [4]. In what follows, we describe the the operators that are used to access the tuple space, how the tuple space has been designed on dierent systems to pass tuples between processors and how tuples are used to generate new processes. Linda uses tuple space to communicate between all processes and to create new processes. In order to access tuple space, Linda dened ve operators originally and by 1989 reduced this list to four operators, all of the operators are listed in Table 8.1. The tuple dened in an OUT() call can be thought of as a list of parameters that are being sent to another process. The templates in the IN(), READ(), INP(), and READP() calls are a list of actual parameters or a formal parameter. An actual parameter contains a value of a given type and a formal parameter consists of a dened variable that can be assigned a value from a tuple. In order for a tuple to match the template, they must have the same structure (same number, type, and order of parameters) and any actual parameters in the template must match the values in the tuple. The result of a tuple matching a template is that the formal variables in the template are assigned the values in the tuple. The implementation of these ve operators and the Tuple Space that handles these operations are what is required to implement Linda on a system. The EVAL() call can be thought of as the method of generating a new concurrent process. Tuple Draft: v.1, April 21, 1994 8.2. MESSAGE PASSING 347 Space in general is a shared data pool where tuples are stored, searched, retrieved, and deleted. Because all of the processes must be able to search the tuple space, retrieve a tuple, and delete the tuple in an atomic process, the implementing of Linda on a shared memory multiprocessor system is simpler than on a distributed system. In this section, we focus on how to implement Linda on a distributed computing environment. Linda uses the tuple space to generate concurrently operating processes by generating a "live tuple" [4]. N. Carriero and D. Gelernter [4] describe a live tuple as a tuple that "carries out some specied computation on its own, independent of the process that generated it, and then turns into an ordinary, data object tuple". The OUT() or EVAL() call can manipulate a live tuple and start a concurrent process on another node. The advantage of Linda is that the application programmer only needs to write the processes that need to be run and generates the live tuples that need to be started in main. When the data tuples are available for the live tuple it will be executed and will generate a data tuple as its output. In Figure 8.1 is an example of the dining philosophers problem in C-Linda as written up by N. Carriero and D. Gelernter [4]. 8.2.2 ISIS ISIS is a message passing tool developed by K. Birman and others at Cornell University to provide a fault-tolerant, simplied, programming paradigm for distributed systems[6, 4]. In this reference, an application using a client-server paradigm where the servers are run on all of their workstations and the client dispatches jobs to the servers and accumulates results. Because of ISIS's fault tolerance, as long as the client stays alive and at least one server at any time, the run will continue. ISIS automatically keeps track of all of the processes currently running under it. If a node crashes, the client is notied and the lost process restarted. ISIS supports a virtual synchronous paradigm that allows a process to infer the state and actions of remote processes using local start information[6]. ISIS was designed to be a fault tolerant system and do to its fault tolerance was not optimized for performance. This lack of performance in the initial release was the main area of concentration in improving the system. 8.2.3 Express Express is a message passing tool that was developed by Parasoft. Express consists of compilers, libraries, and development tools. The development tools include a parallel debugging environment, a performance analysis system, and an automatic parallelization tool (ASPAR)[7]. One of the main goals of Express [8] was that the Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 348 phil (i) int i; f g while (1) f think (); in ("room ticket"); in ("chopstick", i); in ("chopstick", (i+1) %Num); eat (); out ("chopstick", i); out ("chopstick", (i+l) %Num); out ("room ticket"); g initialize () f g int i; for (i = 0; i < Num; i++) f out ("chopstick", i); eval (phil (i)); if (i < Num ; l)) out("room ticket"); g Figure 8.1: C-Linda to Solve Dining Philosophers Problem. Draft: v.1, April 21, 1994 8.2. MESSAGE PASSING 349 PROGRAM EXl C C{ Start up Express C CALL KXINIT C WRITE (G,*) 'Hello World' STOP END Figure 8.2: Express 'Hello world' Program host operating system would not change and that it could be used with code already available. For example, if Express is loaded onto a UNIX system, it does not change the system when running single processor programs or commands, unlike a parallel operating system. As for using code that is already available, Express builds on top of languages such as FORTRAN and C rather than trying to replace them. Some of the more useful features of Express are the dierent operating modes and the ability to run both Host-Node applications and Cubix applications. The operating modes supported in Express simplify the process of allowing a single program to be run on all of the nodes. An example of using single mode is the rst example program in the users guide [8], a simple 'Hello world' program (illustrated in Figure 8.2. In this program, it is the same code that could be written and run on a serial system, when run under Express it looks the same as the serial system result (even if run on multiple processors). This works because Express automatically synchronizes the processes and only prints to the screen once when in single mode (KSINGL(UNIT) - the default start-up mode). Single mode is designed to allow the application to request data once for all of the nodes, get the data, and send the data to all of the nodes. In multiple mode (KMULTI(UNIT)), each processor can make its own requests; however, the requests are buered until a KFLUSH command is reached. The KFLUSH command becomes a synchronization barrier for the application. Once all of the nodes have reached the KFLUSH command, the output (and/or inputs) are performed in order from processor 0 upward. The use of multiple mode can be seen in example 2 in the Users Guide [8] (Figure 8.3). Express supports also asynchronous I/O (KASYNC(UNIT)) that allows the outputs/inputs to occur as soon as they are produced by a processor. By starting out Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 350 PROGRAM EX2 C INTEGER ENV(4) C C{ Start up EXPRESS CALL KXINIT C C{ Get runtime parameters CALL KXPARA(ENV) C{ Read a value WRITE(6,*) 'Enter a value' READ (5,*) IVAL C{ Now have each processor execute independently CALL KMULTI(6) WRITE(6,l0) ENV(l), ENV(l), IVAL, EVAL(l)*IVAL FORMAT(lX, 10 'I am node',I4,' and ',I4,' times',I4,' equals ',I4) CALL KFLUSH(6) C STOP END Figure 8.3: Express Simple Multimode Program. Draft: v.1, April 21, 1994 8.2. MESSAGE PASSING 351 in single mode, the program can get all of the 'common' runtime inputs. Then by switching into multiple mode and using the ENV variable (which is initialized by Express when it starts the programs), it is possible to split up the calculations across all of the processors working on this job. This can be done without requiring any message passing by the application program. The ability to switch modes automatically simplies the synchronization barriers and thus simplies parallel/distributed programming. Furthermore, Express supports two types of programming models: Host-Node and Cubix models. In Host-Node application, there is a host program that usually gets the user inputs, distributes the inputs to all of the node processes, gathers the results from the node processes, and outputs the nal results. This style of parallel/distributed programming is also call client-server. In a Cubix application, only one program is written and it is executed on all of the processors working on this job. This does not mean that all the processors execute the same instruction stream; they might branch dierently depending on the processor number assigned on start-up. It is in the Cubix style of programming that the modes become powerful. Without the ability to distinguish modes, an application in cubix mode would end up running like a host-node application with all of the code compiled into one program. This ability to switch between styles is powerful because when starting to write parallel programs (especially when using RPC) it is usually easier to put programs into a host-node style. On the other hand, the cubix style can be very useful for porting serial code into Express because the cubix style allows the code to request the input parameters as in the serial program. 8.2.4 Parallel Virtual Machine (PVM) PVM is a message passing tool that supports the development of parallel/distributed for a 'collection of heterogeneous computing elements' [9]. To enhance the PVM programming environment, HeNCE (Heterogeneous Network Computing Environment) has been developed [11]. HeNCE is an X-window based application that allows the user to describe a parallel program by a task graph. In what follows, we briey describe how PVM selects a computing element to execute a process, how inter-process communication is done in PVM, and how HeNCE helps the application programmer write a parallel application. PVM selects a computing element to run a process on by using a component description le. The term 'component' describes part of a program that can be executed remotely. The component description le is a list of component names, locations, object le locations, and architectures (an example component description le is shown in Figure 8.4, copied from [9]). When an application requests that a component be executed, this table is used to look up the type of computer that it can be run on and where to nd the executable. A point of inDraft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 352 Name factor factor factor chol chol tool factor2 Location iPSC msrsun msrsun csvax2 vmsvax msrsun iPSC Object le /u0/host/factorl /usr/alg/math/factor /usr/alg4/math/factor /usr/matrix/chol JOE:CHOL.OBJ /usr/view/graph/obj /u0/host/factorl Architecture ipsc sun3 sum4 vax vms sun3 ipsc Figure 8.4: PVM component Description File terest is the way the component 'factor' is represented by three dierent executables for three dierent architectures. This allows PVM to select the best computer to start the process on based on what computers are available, what architectures the component can execute on, and a load factor on the machines that are available and in a designated architecture. Another point of the component description le is that a specic architecture can be selected by creating a new component with the same executable. For example, factor2 (in Figure 8.4) is the same executable as factor, for the ipsc architecture; however, if an instance of factor2 is requested, it must execute on a computer dened to be in the ipsc architecture. The calling component can then communicate with the new process by using the component name and an instance number (the instance number is returned when the call is made to start a component as a new process). PVM also allows an application program to 'register' a component name by using the entercomp() function that has four parameters, the component name, the object lename, the object le location, and an architecture tag (the same information that is stored in the component description le). A simple example of how an application component starts a new process was presented by V. Sunderam in [9] and is reproduced in Figure 8.5. In this example, ten copies of the component 'factor' are started and the instance numbers are stored in an array. The use of initiate() asynchronously starts a process of the component type; however, if the new process can only be started after another process has ended, the initiateP() function will do this. PVM communication is based on a simple message passing mechanism using the PVM send() and recv() functions or on shared memory constructs using shmget, shmat, shmdt, and shmfree. In what follows, we rst discuss the PVM message passing mechanisms and then the shared memory constructs. The send() function requires three parameters; a component name, the instance number, and a type. The recv() function requires one parameter; a type. The type parameter is used to Draft: v.1, April 21, 1994 8.2. MESSAGE PASSING 353 enroll (\startup"); for (i = 0; i < 10; i++) instance[i] = initiate (\factor"); Figure 8.5: PVM how to initiate multiple component instances order messages received by the receiving process. The component name and instance number are translated automatically by PVM into the processor that is running that process. An example of the use of the PVM send and receive functions was given by V. Sunderam [9] and is reproduced in Figure 8.6. One important dierence between PVM and most heterogeneous communication systems (such as SUN XDR) is that PVM does not translate into a machine independent format while sending and translate back to the machine dependent format on receiving. According to G. Geist and V. Sunderam [10], PVM chooses a 'standard' format based on the format that is common to the majority of the computers in the pool and all communication is done in this format. This is based on the theory that in most environments there is one architecture that most of the computers adhere to, and this method allows them to communicate with no translations being done. PVM also contains two variations on the recv() function; recvl() and recv2(). If recv() is used, the process is blocked until a message of that type is received. If recvl() is used, the process is blocked until either a message of the type is received or a maximum number of messages of other types have been received. Finally, if recv2() is used, the process is blocked until either a message of the type is received or a time-out has been reached. PVM also provides a broadcast primitive that sends a message to all instances of a specied component. Along with the message passing functions is a set of shared memory functions. The use of the shared memory functions in PVM must be fully thought out before using them to develop parallel/distributed applications [9]. This is because on distributed computing systems the shared memory must be emulated by PVM, and this leads to a degradation in performance. The PVM shared memory functions are similar to the UNIX shared memory inter-process functions. First, the shared memory segment is allocated by using the function shmget() with two parameters; a name and a size. Next the memory segment is attached to using a shmat function. Here PVM provides three dierent versions shmat(), shmatoat(), and shmatint(). The shmat() function attaches an untyped block of memory, shmatoat() attaches a block of memory that holds oating point values, and shmatint() attaches an array of integers. An example of using PVM shared memory communications is shown in Figure 8.7 [9]. In this example, the shmatoat() function is used because it is storing an array of oating Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 354 /* Sending Process */ /* ||||| */ initsend(); /* Initialize send buer */ putstring("The square root of ");/* Store values in */ putint (2); /* machine independent */ putstring("is "); /* form */ putoat (1:414); send("receiver", 4, 99); /* Instance 4; type 99 */ /* Receiving Process */ /* |||||{ */ char msgl[32], msg2[4]; int num; oat sqnum; recv(99); getstring(msgl); getint (&num); getstring(msg2); getoat (&sqnum); /* Receive msg of type 99 */ /* Extract values in */ /* a machine specic */ /* manner */ Figure 8.6: PVM User data transfer. Draft: v.1, April 21, 1994 8.2. MESSAGE PASSING 355 /* Process A */ /* ||| */ if (shmget("matrx",1024)) error(); /* Allocation failure */ while (shmatoat ("matrx", fp, "RW", 5));/* Try to lock & map seg */ for (i=0; i<256; i++) *fp++ = a[i]; /* Fill in shmem segment */ shmdtoat("matrx"); /* Unlock & unmap region */ /* Process B */ /* ||| */ while (shmatoat(\matrx",fp,"RU",5)); /* Lock&map; note:reader */ /* may lock before writer */ for (i=0; i<256; i++) a[i] = *fp++; /* Read out values */ shmdtoat(\matrx"); /* Unlock & unmap region */ shmfree("matrx"); /* Deallocate mem segment */ Figure 8.7: PVM shared memory communication point numbers. After the shared memory segment is attached to, it is accessed using the pointer used in the shmat function. Once the process is done accessing the shared memory, it must perform a detach operation to release the memory and allow another process to access it. The detach function used is the equivalent of the attach function. For example if shmatint() is used shmdtint() should be used. Finally, the last process to access the shared memory segment should free up the memory by calling the shmfree() function. PVM also has lock facilities that work in a similar manner. In general, the PVM communication functions are similar to the standard UNIX library functions for inter-process communication, and therefore should simplify porting multiprocess single workstation programs into a multiprocessor environment. Finally, HeNCE is available to assist the application programmer in developing an application for a group of heterogeneous computers on a network. HeNCE is based on having the programmer explicitly specify the parallelism of the application. HeNCE uses a graphical interface and a graph of how the functions are interrelated to dene the parallelism in the application. The HeNCE graphs have constructs for loops, fans, pipes, and conditional execution. An application consists of a graph of the functions and the source code for the functions. HeNCE than obtains the parameters of the functions using HeNCE library routines (which uses PVM to initiate processes and for communication). HeNCE creates 'wrappers' for each function and compiles the wrappers into the nal exDraft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 356 ecutable. The wrappers perform all of the process initiation and communication. HeNCE achieves fault tolerance through checkpointing [11]. HeNCE is an interesting tool and its proposed future enhancements will allow writing hierarchical graphs (graphs containing graphs). 8.2.5 Discussion It is interesting to see how diverse the programming environments are. Linda tries to make the distributed system look like a shared memory system. ISIS concentrated on reliability and detecting node failures and recovering. While Express and PVM concentrated on getting the best performance possible and migrating code already written, but even Express and PVM went about the objective dierently. Express uses its modes to transparently perform synchronization. While PVM uses functions that have a lot of commonality with the UNIX inter-process communication functions. Due to the diverse goals, it is dicult to judge any one environment to be best for all applications. When comparing any of the four to writing a parallel program using RPC calls, all four provide valuable resources that will make writing the application easier. When writing an application using RPC calls, the application programmer must gure out a way to start-up the remote processes (possibly using 'rsh' calls) and needs to handle starting communication processes if asynchronous communication is to be used. Another complication that all four of these environments seem to help is the problem of ending the processes cleanly in an orderly fashion. This is complicated when using RPC because all communication is synchronous; therefore, if it is sometimes dicult to distinguish between lost packets, process that died prematurely, and processes that ended normally without writing a signicant amount of code. Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY 357 8.3 Distributed Shared Memory Distributed Shared Memory (DSM) system is a mechanism to provide a virtual address space shared among processes running on loosely coupled computing systems. There two kinds of parallel computers that include tightly-coupled shared-memory multiprocessors and loosely-coupled distributed-memory Multiprocessors. A sharedmemory system simplies the programming task since it is a natural extension of a single-CPU system. However, this type of multiprocessor has a serious bottleneck, main memory is usually accessed via a common bus - a serialization point { that limits system size to few processors. Distributed-memory systems, on the other hand, scale up very smoothly provided the designers choose the network topology carefully. However, the programming model is limited to message-passing paradigm since the processors communicate with each other by exchanging messages via the communication network. This implies that the application programmers must take care of information exchange among the processes by using the communication primitives explicitly as described in the pervious subsection. Distributed shared memory scheme tries to combine the advantages of both systems by providing a virtual address space shared among processes running on a distributed-memory system. The shared-memory abstraction gives these systems the illusion of physically shared memory and allows programmers to use shared-memory paradigm. In this section, we highlight the fundamental issues in designing and implementing a distributed shared memory system and locate the parameters inuencing the performance of DSM ststems. After discussing the key issues and algorithms for DSM systems. A few existing DSM systems are discussed briey. Traditionally, communication among processes in a distributed system is based on the data passing model. Message-Passing systems or systems that support remote procedure calls adhere to this model. The data-passing model logically extends the underlying communication mechanism of the system; port or mailbox abstractions along with primitives such as Send and Receive are used for interprocess communication. This functionality can also be hidden in language-level constructs, as with RPC mechanisms. In either case, distributed processes pass shared information by value. In contrast to the data-passing model, the shared memory model provides processes in a system with a shared address space. Application programs can use this space in the same way they use normal local memory. That is, data in the shared space is accessed through Read and Write operations. As a result, applications can pass shared information by reference. The shared memory model is natural for distributed computations running on tightly-coupled systems. For loosely coupled systems, no physically shared memory is available to support such a model (see Figure 8.8. However, a layer of software or a modication of the hardware can provide a shared memDraft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 358 ory abstraction to the applications. The shared memory model applied to loosely coupled systems is referred to as Distributed Shared Memory (DSM). The advantages oered by DSM include ease of programming and portablity achived through the shared-memory programming paradigm, the low cost of distributed memory machines, and scalability resulting from the absence of hardware bottleneck. The advantages of distributed shared memory hav made it the focus of recent study and have prompted the development of various algorithms for implementing the shared data model. However; to be successful as a programming model on the loosely coupled systems, the performance of this scheme must be at least as comparable to that of message passing paradigm. DSM systems have goals similar to those of CPU cache memories in shared memory multiprocessors, local memories in shared memory multiprocessors with nonuniform memory access (NUMA) times, distributed chaching in networkle systems, and distributed databases. In particular, they all attempt to minimize the access time to potentially shared data that is to be kept consistent. Consequently, many of the algorithmic issues that must be addressed in these systems are similar. Although these systems therefore often use algorithms that appear similar from a distance, their details and implementations can vary signicantly brcause of dierences in the cost parameters and the way they are used. For example, in NUMA multiprocessors, the memories are physically shared and the time dierential between accesses of local and remote memory is lower than in distributed systems, as is the cost of transferring a block of data between the local memories of two processors. Distributed shared memory has been implemented on top of physically non-shared memory architecture. Distributed Shared Memory described here is a layer of software providing a logically shared memory. As Figure 8.8 shows distributed shared memory (DSM) provides a virtual address space shared among processes in loosely coupled distributed multiprocessor system. 8.3.1 DSM Advantages The advantages oered by DSM are the ease and portability of programming. It is a simple abstraction which provides application programmer with a single virtual address space. Hence the program written sequentially can be transferred to distributed system without drastic changes. DSM hides the details of the remote communication mechanism from processes such that programmers don not have to worry about the data movements and the marshall and unmarshall procedures in the program. This simplies substantially the programming task. Another important advantage is that complex data structure can be passed by reference in DSM such as pointer. On the contrary, in data passing mode, complex data structures have to be transfer to t the format of communication primitives. One can view the DSM as a step to enhance Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY 359 Shared Memory Memory CPU Memory CPU Memory CPU Network Figure 8.8: Distributed Shared Memory System Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 360 the transparency provided by a distributed system. 8.3.2 DSM Design Issues A survey of current DSM systems was conducted by Nitzberg et al. [13], they showed the important issues in designing and implementing the DSM. In this subsection, we focus on these issues that must be addressed by the designers of the DSM systems. A DSM designer must make choices regarding memory structure, granularity, access, coherence semantics, scalability, and heterogeneity. Structure and granularity: The structure and granularity of a DSM system are closely related. Structure refers to the layout of the shared data in memory. Most DSM systems do not structure memory, but some structure the data as objects, language types, or even an associative memory. Granularity reers to the size of the unit of sharing: byte, word, page, or complex structure. In systems implemented using the virtual memory hardware of the underlying architecture, it's convenient to choose a multiple of the hardware page size as the unit of sharing, which normally ranging from 256 bytes to 8K bytes. Ivy [14], one of the rst transparent DSM systems, chose the granularity of the memory as 1K bytes. Hardware implementations of DSM typically support smaller grain sizes. For example, Dash used 16 bytes as the unit of sharing. Because shared-memory programs exhibit locality of reference, a process is likely to access a large region of its shared address space in a small amount time. Therefore, large page sizes reduce paging overhead. However; sharing may also cause contention, and the larger the page size, the greater the likelihood that more than one process will require access to a page. A smaller page reduces the possibility of false sharing, which occurs when two unrelated variables are placed in the same page. To avoid thrashing, the result of false sharing, structuring the memory was adopted in some of the DSM systems. Munin [17] and Shared Data-Object Model structured the memory as variables in the source languages. Emerald, Choice, and Clouds [18] strucrured the memory as objects. Scalability: A theoretical benet of DSM systems is that they scale better since the systems are based on loosely coupled machines. However; the limits of scalability could be greatly reduced by two factors: central bottleneck and global common knowledge operations and storage. To avoid the above factors, the shared memory and the related information should be distributed among the processors as even as possible. Central memory manager is not preferred in terms of scalability. Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY 361 Heterogeneity: Sharing memory between two dierent types of machines seems to be not feasible because of the overhead of translating the internal data representations. It's a bit easier if the DSM system is structured as variables or objects in the source language. In Agora, memory is structured as objects shared among heterogeneous machines. Mermaid explores another approach: memory is shared in pages, and a page can only contain one type of data. Although it works fairly well, requiring each page containing only one type of data is just too rigid. Generally speaking, the overhead of conversion seems to outweigh the benets gained from more computing power. In distributed shared memory, memory mapping managers or someone called distributed shared memory controller (DSMC) implements the mapping between local memories and the shared virtual memory address space. A DSMC must automatically transform shared memory access into interprocess communiation. Other than mapping, their chief responsibility is to keep the address space coherent at all times; that is, the value returned by a read operation is always the same as the value written by the most recent write operation to the same address. A shared virtual memory address space is partitioned into pages. Pages that are marked read-only can have copies residing in the physical memories of many processors at the same time. But a page marked write can reside in only one physical memory. The memory mapping manager views its local memory as a large cache of the shared virtual memory address space for its associated processor. Like the traditional virtual memory, the shared memory itself exists only virtually. A memory reference causes a page fault when the page containing the memory location is not in a processor's current physical memory. When this happens, the memory's mapping manager retrieves the page from either the disk or the memory of another processor. If the page of the faulting memory reference has a copy on other processors, then the memory mapping manager must do some work to keep the memory coherent and then continue the faulting instruction. Another key goal of the distributed shared memory is to allow processes execute on dierent processors in parallel. In order to do so, the appropriate process management must be integrated with DSMC. Coherence Semantics For programmers to write correct programs on a shared memory machine, they must understand how parallel memory updates are propagated throughout the system. The most intuitive semantics for memory coherence is strict consistency. In a system with strict consistency, a read operation returns the most recently written value. However, the strict consistency implies that access to the same memory location must be serialized, which will become a bottleneck to the whole system. To improve the performance, a few systems just provide a reduced form of memory coherence. Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 362 Table 8.2: Four Distributed Shared Memory algorithms. Nonreplicated Replicated Non-migrating Central Full-replication Migrating Migration Read-replication Relaxed coherence semantics allows more ecient shared access because it requires less synchronization and less data movement. However, programs that depend on a stronger form of consistency may not perform correctly if executed in a system that supports only a weaker coherence. Figure 8.9 is a diagram of the intuitive denitions of various forms of coherence semantics. A singled-CPU multitasking machine supports the strict consistency, and a shared memory system can only support sequential consistency if data accesses are buered. As an illustration, consider the diagram shown in Figure 8.10[17]. R1 through R3 and WO through W4 represent successive reads and writes, respectively, of the same data, and A; B; C are threads attempting to access that data. strict coherence requires that thread C at R1 read the value written by B at W2, and that thread C at R2 and R3 read the value written by thread B at W4. Sequential consistency, on the other hand, requires only that thread C at R1 and R2 read the values written at any of W0 through W4 such that the value read at R2 does not logically precede the value read at R1, and that thread C at R3 read either the value written by thread A at W3 or thread B at W4. DSM Algorithms Among the factors that aect the performance of a DSM algorithm, migration and replication are the two most important parameters. The properties of these two factors were investagated by Zhou, et. al[12]. They used the combinations of these two factors and developed four kinds of DSM algorithms, namely, the central-server, the migration, read-replication, and full-replication. These algorithms all support strict consistency. Table 8.2 shows the characteristics of these four algorithms in terms of migration and replication. Central-Server The strategy of central server mode is using a central server which is responsible for servicing all the access of shared data. Both read and write operations involve the sending of a request message to the data server by the process excuting the operation, Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY 363 Strict Consistency A read returns the most recently written value. Sequential Consistency The result of any excution appears as some interleaving of the opearations of the indivisual nodes when excuted on a multithreaded sequential machine. Processor Consistency Weak Consistency Writes issued by each individual node are never seen out of order , but the order of writes from two different nodes can be observed diferently. The programmer enforces consistency using synchronization operators guaranteed to be sequentially consistent. Release Consistency Weak consistency with two thpes of synchronization operators: acquire and release. Each type of operator is guaranteed to be processor consistent. Figure 8.9: Memory coherence semantics Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 364 W0 A W1 W3 W2 W4 B R1 R2 R3 C SYNC SYNC Figure 8.10: A timing diagram for multiple memory requests to shared data as depicted in Figure 8.11. Then the server receives the requests and respond either with data in case of read request or an acknowledgment in case of write request. This simplest model can use "request and respond" communication to realize it. For reliability, a request will be retransmitted after timeout period. The server has to keep a sequence number for each request so that it can detect duplication and acknowledge back correctly. A failure can be raised if there is no response after several time-out periods. The most important drawback of this strategy is that itself will become a bottleneck. The central server has to handle requests sequentially. Even referencing the data locally, it still needs to request the server, which increases the communication activities. To distribute the server load, the shared data can be distributed onto several data servers. The clients can use multicast to locate the right data server. In this case, the servers still have to deal with the same amount of requests. A better solution is to partition the data by address and distribute them to several hosts. Clients only need use simple mapping to nd the correct one. Migration Technique Migration In this technique, the data is always migrated to the site where it is accessed regardless of the type (read or write) of operations (see Figure 8.12. This is a single read single write protocol (SRSW), data is usually migrated in blocks. In Zhou et al[12], a manager is assigned statically to each block, which is responsible for recording the current location of this block. A client queries the manager of the block, both to determine the current location of the data and to inform the manager that it will be the new owner of that block. If an application exhibits high locality of reference, the cost of block migration is amortized over multiple accesses. Another advantage of this scheme is that it can be integrated with the virtual memory system of the host. Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY clients A B 365 (1) client A requests data from central server 2 (2) central server sends data to client A 1 central server C D clients Figure 8.11: Central Server 1 A B C D (1) client A determines location of desired data and sends a migration request (2) client C migrates the data to client A as requested 2 Figure 8.12: Migration Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 366 1 WRITE OPERATION A B 2 C D (1) client A determines location of desired data and sends a request for that page (2) client C sends requested data (3) client A multicasts invalidation for that block 3 Figure 8.13: Read replication Read Replication Both of the previous approaches do not take advantages of parallel processing. Replication can be added to the migration algorithm by allowing either one site read/write copy or multiple sites of read copies of a block. This type of replication is referred to as "Multiple readers/single writer"(MRSW) replication. Replication can reduce the average cost of read operations, since it allows read operations to be simultaneously executed locally at multiple hosts. However, some of the write operations may become more expensive, since many replicas may have to be invalidated or updated for data consistency. It is worthwhile to do so if the ratio of read operation over write is large. For a read operation on a data that is not local, it is necessary to acquire a readable copy of the block containing that data, and to inform the site with the writable copy to change to read only status before the read operation can excute. For a write fault, either caused by not having that block or not having the write access right, all copies of the same block must be invalidated berfore the write can proceed. Figure 8.13 shows the write fault operation of this algorithm. The main advantage of this method is to reduce the average cost of read operations. But the overhead of invalidation for write faults will not be justied if the ratio of reads over writes is small. This strategy resembles the write-invalidate algorithm for cache consistency implemented by hardware in some multiprocessors Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY sequencer 1 A WRITE OPERATION: 2 B 367 C (1) client A wants to write; sends data to sequencer (2) sequencer orders the write request and multicasts the write data (3) local memory of all clients D 3 is updated Figure 8.14: Full Replication Full Replication Full replication algorithm allows data blocks to be replicated while in write operation, in other words " multiple reader/multiple writer " (MRMW). One possible way to keep the replicated data consistent is to globally sequence the write operations. On way of implementation is that when a process attempts to write, the intention is sent to the sequencer. The sequencer assigns the next sequence to the request and multicast this request to all the replicated sites. And each site processes the write operation in sequence number order. If the sequence number of the coming request is not expected, then retransmission will be acquired to the one who requests. A negative acknowledgment and a log of previous requests are implemented here. Figure 8.14 shows the write operation of this method. Discussion The performance of parallel program on a distributed shared memory depends primarily on two factors: the number of parallel processes in the program and how often is the updating of shared data. The central-server algorithm is the simplest implementation and may be sucient for infrequent accesses to shared data, especially if the read / write ratio is low (that is, a high percentage of accesses are writes). This is often the case with locks, as will be discussed further below. In a fact, locality of reference and a high-blockhit ratio are present in a wide range of applications, making block migration and replication favorable. Though block migration is more advantageous than the general case, the migarion cost of a simple migration is expensive. A potentially serious Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 368 performance problem with this algorithms is block thrashing. For migration, it takes the form of moving data back and forth in quick succession when interleaved data read are made by two or more sites. It thus does not exploit the merits of paralle processing. In contrast, the read-replication algorithm ofers a good compromise for many applications. However, if the read and write accesses interleave frequently, it takes the form of blocks with read-only permissions being repeatedly invalidated soon after they are replicated. It thus does not explore locality to its full extent. The full-replication algorithm is suitable for small scale replication and infrequent updates. Such situations indicate poor (site) locality in references. For many applications, shared data can be allocated and the computation can be partitioned such that thrashing is minimized. Application-controlled locks can also be used to suppress thrashing. In either case, the complete transparency of the distributed shared memory is compromised somewhat. In Zhou et al. [12], a series of theoretical comparitive analysis among the above four algorithms is conducted under some assumptions on the environment. Below is a summary of this comparative study of these algorithms: Migration vs. Read Replication: Typically, read replication eectively reduces the block fault rate because, in contrast to the migration algorithm, interleaved read accesses to the same block will no longer cause faults. Therefore, one can expect read replication to outperform migration for a vast majority of applications. Read Replication vs. Full Replication: The relative performance of these two algorithms depends on a number of factors, including the degree of replication, the read/write ratio, and the degree of locality achievable in read applications. Generaly speaking, full replication performs poorly for large systems and when update frequency is high. Central Server vs. Full Replication: For small number of sites and the read/write ratio is fairly high, full replication performs better; but if the number of sites increases, the update costs of full replication catch up, and the preference turns to central server. 8.3.3 DSM Implementation Issues Being studied extensively since the early of 1980s, DSM system have been implemented using three basic approaches (some system use more than one approach): 1. hardware implementations that extend traditional caching techniques to scalable architecture's. Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY 369 2. operating system and library implementations that achieve sharing and coherence through virtual memory-management mechamsm. 3. compiler implementations where shared accesses are automatically converted into synchronization and coherence primitives. For example, Dash system which is developed at Stanford University was a hardware implementation instance of two-level meshes structure. And Memnet, developed at University of Delaware, was built on 200-Mbps token ring network. Plus was a mixture of hardware and software implementation of DSM, developed at Carnegie Mellon University. Most of those distributed systems which are under development or already developed are using operating system and library implementation such as Amoeba, Clouds, Mirage and V system. Some of them like Mermaid and Linda are the representatives of combinations of approach (2) and approach (3). Heterogeneity Some designers are pretty ambitious in trying to integrate several heterogeneous machines in a distributed system. The major hinder here is that every machine may uses dierent representations of basic data types. It would be better to implement it in the DSM compiler level which can convert into dierent types in dierent machines. However, the overhead of conversion seems overweigh the benets. Dynamic centralized manager algorithm The simplest way to locate data is to have a centralized server that keeps track of all shared data. In this scheme, a page does not have a xed owner, and only the manager or centralized server knows who the owner is. The centralized manager resides on a node and maintains an information table which has one entry for one page, each entry having three elds: 1. Owner eld: indicate which node owns this page. 2. Copy List eld: list all nodes that have copies of this page. 3. Lock eld: indicate whether the demading page is being accessed or not. As long as a read copy exists, the page is not writable without an invalidation operation. It is more benetial that a successfule write to the page has the ownership of the page. When a node nishes a read or write request, a confrrmation message is sent to the manager to indicate the completion of the request. The centralized method suers from two drawbacks: (1) The sever serializes location queries, reducing parallelism, and (2) the server may become heavily loaded and slow the entire system. Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 370 Fixed distributed manager algorithm For large N, there maight be a bottleneck at the manager processor because it must respond to every page fault. Instead of using a centralized manager, a system can distribute the manager load evenly to several nodes in a xed manner. (8:1) H (p) = ps modN Where p is an integer, N is the number processors, and s is the number of pages per segment. With this approach, when a fault occurs on page p, the faulting processor will ask processor H (p) where the page owner is, and proceed the sequences as in centralized manager algorithm. Broadcast distributed manager algorithm With this strategy, each processor manages those pages that it owns. In this scheme, when a read fault occurs, the faulting processor p sends broadcast read request, and the true owner responds by adding p to its copy list and sending a copy of the page to p. simllarlly, when a write fault occurs, the faulting processor sends a broadcast write request, and the owner gives up the ownership and migrate the page and the copy list to the client. When th faulting processor receives the ownership, it invadilates all the copies. Dynamic distributed manager algorithm The heart of dynamic distributed manager algorithm is keeping track of the ownership of all pages in each processor local page table. The owner eld of each entry is replaced by probable owner eld. Initially, this eld is set to the initial owner of the page. In this algorithm, when a processor has a page fault, it sends a request to the processor indicated by probable owner eld. If it not true, it forwards the requests to the processor indicated by this processor's probable owner eld. This procedure goes on until The true owner is found. Coherence Protocol All systems support some level of coherence. From the programmer's point of view, it would be best that the system support strict consistency. However; it would require all accesses to the same shared data being sequentialized. This would degrade the performance of the system. Moreover, a parallel program species only a partial order instead of a linear order on the events within the program. A relaxed coherence Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY 371 would be sucient for the parallel applications. For example, Munin only realizes weak consistency, Dash supports release consistency. To further increase parallelism, virtually all DSM systems replicate data. Two types of protocols handling write faults, write-invalidate and write-update, can be used to reinforce coherence. The writeinvalidate method is the same as that of read replication, and write-update is the same as that of full replication. Most of the DSM systems have write-invalidate coherence protocols. The main reason, as suggest by Li [14], is the lack of hardware support and the ineciency caused by network latency. However; a hardware implementations of DSM can use write-update freely, e.g. Plus, Munin uses type-specic coherence, a scheme tailored for dierent access patterns of data. For example, Munin uses writeupdate to keep coherent data that is read much more frequently than it is written. Page Replacement Policy In systems that allow data to migrate around the system, two problems arise when the available space for "caching" shared data lls up: Which data should be replaced to free space and where should it go? In choosing the data item to be replaced, a DSM system works almost like the caching system of a shared-memory multiprocessor. However, unlike most caching systems, which use a simple least recently used or random replacement strategy, most DSM systems dierentiate the status of data items and prioritize them. For example, priority is given to share items over exclusively owned items because the latter have to be transferred over the network. Simply deleting a read-only shared copy of a data item is possible because no data is lost. Once a piece of data is to replaced, the system must make sure it is not lost. In the caching system of a multiprocessor, the item would simply be placed in main memory. Some DSM systems, such as Memnet, use an equivalent scheme. The system transfers the data item to a "home node" that has a statistically allocated space to store a copy of an item when it is not needed elsewhere in the system. This method is simple to implement, but it wastes a lot of memory. An improvement is to have the node that wants to delete the item simply page it out onto the disk. Although this does not waste any memory space, it is time consuming. Because it may be faster to transfer something over the network than to transfer it to a disk, a better solution is to keep track of free memory in the system and simply page the item out to a node with space available to it. Thrashing DSM systems are particularly prone to thrashing. For example, if two nodes compete for write access to a single data item, it may be transferred back and forth at such a high rate that no real work can get done (a Ping-Pong eect). To avoid thrashing Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 372 with two computing writers, a programmer could specify the type as write-many and the system would use a delayed write policy. When nodes compete for accessing the same page, one can stop the Ping-Pong eect by adding a dynamically tunable parameter to the coherence protocol. This parameter determines the minimum amount of time a page will be available at one node. Related algorithms To support a DSM system, synchronization operations and memory management must be specially tuned. Semaphores, for example, are typically implemented on shared memory systems by using spin locks. In a DSM system, a spin lock can easily cause thrashing, because multiple nodes may heavily access shared data. For better performance, some systems provide specialized synchronization primitives along with DSM. Clouds provides semaphores operation by grouping semaphores into centrally managed segments. Munin supports the synchronization memory type with distributed locks. Plus supplies a variety of synchronization instructions, and supports delayed execution, in which the synchronization can be initiated, then later tested for successful completion. Memory management can be restructured for DSM. A typical memory allocation scheme (as in the C library malloc()) allocates memory out of a common pool, in which the search of all shared memory can be expensive. A better approach is to partition available memory into private buers on each node and allocate memory from the global buer space only when the private buer is full. The implementation issues that we discussed in this section is by no means complete. A good survey of the issues have been discussed in Nitzberg [13], which shows the various options of the design parameters. Table 8.3 summarizes the desing parameters and options adopted in several DSM projects. Draft: v.1, April 21, 1994 8.3. DISTRIBUTED SHARED MEMORY Table 8.3: Survey of DSM Design Parameters System Name Current Implementation Dash Hardware, modied Silicon Graphics Iris 4D/340 workstations, mesh Software, Apollo workstations, Apollo ring, modied Aegis Software, variety of environments Hardware, token ring Ivy Linda Memnet 373 Mermaid Software, Sun workstations DEC Firey multiprocessors, Mermaid/native operating system Mirage Software, VAX 11/750, Ethernet, Locus distributed operating system, Unix System V interface Munin Software, Sun workstations, Ethernet, Unix System V kernel and Presto parallel programming environment Plus Hardware and software, Motorola 88000, Caltech mesh, Plus kernel Shiva Software, Intel iPSC/2, hypercube, Shiva/native operating system Structure Coherence Coherence and Semantics Protocol Granularity 16 bytes Release Writeinvalidate Sources of Improved Performance Relaxed coherence, prefetching Support for Synchronization Queued locks, atomic incrementation and decrementation Heterogeneous Support No l-Kbyte pages Strict Writeinvalidate Pointer chain collapse. selective broadcast No Tuples No mutable data Strict Varied Hashing Synchronized pages, semaphores, event counts Writeinvalidate Vectored interrupt support of control ow 32 bytes ? No 8 Kbytes (Sun), 1 Kbyte (Firey) Strict Writeinvalidate 512-byte pages Strict Writeinvalidate Objects Weak Type-specic (delayed write update for read-mostly protocol) Page for sharing, word for coherence Processor Nondemand write-update Delayed operations Complex No synchronization instructions 4-Kbyte pages Strict Writeinvalidate Data structure compaction, memory as backing store Messages for semaphores and signal/ wait Kernel-level implementation, time window coherence protocol Delayed update queue Messages for semaphores and signal/ wait ory Yes Unix System V semaphores No Synchronized objects No Draft: v.1, April 21, 1994 No CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 374 8.4 DSM Existing Systems The design and implementation techniques discussed in the previous section are common in all DSM systems, but dierent systems implement them in a variety of ways. For Instance IVY is a software implementation of a DSM system. Another system, Clouds, uses concepts such as passive objects and threads of execution in its objectoriented approach. Hardware implementations of DSD even exist, as in MemNet system. These systems are discussed briey in the next subsections. 8.4.1 Ivy System Ivy system is one of the rst DSM implementation, which provides strict consistence and write invalidate protocol. Li [14] presented two classes of algorithms, centralized and distributed, for solving coherence problem. All the algorithms utilize replication to enhance the performance. A prototype on Apollo ring based on these algorithms was implemented. Many existing DSM systems are a modication of Ivy. Ivy supports strict consistence, and write-invalidate (MRSW) coherence protocol, the shared memory is partitioned into pages of size 1K bytes. The memory mapping managers implement the mapping between local memories and the shared virtual memory address space. Other than mapping, their chief responsibility is to keep the address space coherent at all times. Li classied the algorithms by two factors, page synchronization and page ownership. The approaches to page synchronization are write-invalidate and write-update. As mentioned before, the authors argued that write-update is not feasible since this approach requires special hardware support and network latency is high. The ownership of a page can be xed or dynamic. The xed strategy corresponds to the algorithms that does not migrate the data. The dynamic methods are those that migrate the data. The authors argued that the non-migration method is an expensive solution for existing loosely coupled multiprocessors, and it constrains desired modes of parallel computation. Thus they only consider the algorithms with dynamic ownership and write invalidation, which correspond to the class of read replication discussed previously. The authors further categorized read replication algorithms into three sets: centralized manager, xed distributed manager, dynamic manager. All The above algorithms are described by using fault handlers, their servers, and the data structure on which they operate. The data structure, referred to as page table, common to all the algorithms contains three items about each page: 1. access: indicates the accessibility to the page, 2 copy set: contains the processor numbers that have read copies of the page, and Draft: v.1, April 21, 1994 8.4. DSM EXISTING SYSTEMS 375 3 lock: synchronizes multiple pages faults by dierent processes on the same processor and synchronizes remote page requests. Centralized Manager The centralized manager is similar to a monitor. The manager resides on a single processor and maintains a table called Info which has one entry for each page, each entry has three elds: 1. The owner eld contains the single processor that owns that page, namely, the recent processor to have write access to this page. 2. The copy set eld lists all processors that have copies of the page. 3. The lock eld is used for synchronizing requests to the page. Now each processor has only two elds, access and lock for each page. For a read fault, the processor asks the manager for read access to the page and inquires a copy of that page. The manager is responsible for asking the page's owner to send a copy to the requesting node. Before the manager is ready to process the next request, it must receive a conrmation message from the current requesting node. The conrmation message indicates the completion of a request to the manager so that the manager can give the page to some one else. Write faults are processed in the same manner except that the manager needs to invalidate the read copies for the owner. This algorithm can be improved by moving the synchronization of page ownership to the owners, thus eliminating the conrmation operation to the manager. The copy set is maintained by the current owner. The locking mechanism on each processor now deals not only with multiple local requests, but also with remote requests. Fixed Distributed Managers In this method, every processor is given a predetermined subset of the pages to manage. The primary diculty in such a scheme is choosing an appropriate mapping from pages to processors. The most straightforward approach is to distribute pages evenly in a xed manner to all processors. For example, Given M pages and N processors, the hashing function H (p) = pmodN , where 0 < p <= M , can be used to distribute the pages. To realize this method, some input from the programmer or the compiler might be needed to indicate the properties of the suitable mapping functions. Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 376 Dynamic Distributed Managers The heart of a dynamic distributed manager algorithm is keeping track of the ownership of all pages in each processor's local page table. To do this, the owner eld is replaced with another eld, probOwner; whose value can be either the current owner or an old owner. As explained in the section entitled implementation issues, When a processor has a page fault, it sends a request to the process or indicated by the probOwner eld for that page. If the receiver is the current owner; it replies the request node and with a copy of that page, as well as the copy set if it's a write request, otherwise, it forwards the request to the processor indicated by its probowner eld of that page. 8.4.2 Mirage Mirage was implemented in the kernel of an early version of Locus compatible with UNIX system V interface specications [16]. The approach used in Mirage is dierent from Ivy in the following ways: 1. The model is based on paged segmentation, the page size is 512 bytes. 2. The unit of sharing is a segment, the unit of coherence is a page. 3. A tuning parameter; delta, used to avoid thrashing is provided. 4. It adopts xed distributed managers algorithm by assigning the creator of a segment as the manager or, in Mirage's terminology, the library site. 5. As mentioned above, it was implemented in the kernel of the operating system to improve the performance. 6. The environment consists of three VAX 1 1/750s networked together using Ethernet, which is smaller than that of Ivy's. 8.4.3 Clouds Clouds is an object-oriented DSM system which may seem unconventional in comparison to most other software or hardware implementations. It employs the concept of threads and passive objects. Threads are ows of control which execute on Clouds object. At any given time during its existence, a Clouds thread executes within the object it most recently invoked. Objects are logically cohesive groupings of data composed of variable-size segments. The Clouds system supports the mobility of segments, and therefore indirectly supports object. Draft: v.1, April 21, 1994 8.4. DSM EXISTING SYSTEMS 377 When a thread on one machine invokes an object on another machine, one of two things can occur depending upon the specic implementation. One possibility is that the thread is constructed at the host containing the desired object. The reconstructed thread is then executed within the desired object, and the result is nally passed back to the calling thread (see Figure 8.15). The second possibility involves the migration of the invoked object to the host where the calling thread resides, after which the invocation is executed locally (see Figure 8.16). HOST A reconstructed thread P APPROACH 1 HOST B (1) thread P info sent to host B (2) host B reconstructs thread P (3) the reconstructed thread executes within the desired object (4) the results are sent back to host A thread P 3 2 1 object 4 Figure 8.15: Object invocation through thread reconstruction HOST A HOST B APPROACH 2 (1) the request for the object is sent to host B thread P 3 (2) host B sends the object back to host A (3) thread P is executed locally within the object 1 object 2 Figure 8.16: Object invocation through passing objects Clouds supports variable object granularity, since objects are composed of variableDraft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 378 host host host bus interface ... memnet cache Figure 8.17: MemNet Architecture size segments; often, the page sizes of the host machines place a lower limit on segment size. The cache coherence protocol of Clouds applies to segments. It dictates that a segment must be disposed of when access to it has been completed. This eliminates the need for invalidations when a segment is written to, but it forces the re-fetching of the object when it is invoked again. 8.4.4 MemNet Designers of the MemNet system goal was to improve the poor data/overhead ratio of the interprocess communication (IPC) that is common in software implementations. They chose a token-based ring hardware implementation. The computers in the MemNet system are not connected directly to the token-ring directly but instead are connected via a "MemNet device" interface 8.17. When a shared memory reference to a specic "chunk" (32-byte block) of memory is made, it is passed to the MemNet device. If the device indicates that the request can be satised locally, then no network activity is required. If the chunk is currently resident elsewhere, a request message for the chunk is sent. What happens next is determined by the type of the request. On a read request, a message requesting the desired chunk specically for read is composed and sent. It travels around the token-ring and is inspected by each MemNet device in turn. When the request reaches the MemNet device that actually has the desired chunk, the request message is converted to a data message, and the chunk Draft: v.1, April 21, 1994 8.4. DSM EXISTING SYSTEMS 379 data is copied into it. The message then travels back to the devices on the ring ignore the message because they recognize that the data has already been lled in. When a particular host wants to write to a non-resident chunk, it not only must receive the data from the current owner, but also is responsible for invalidating any other copies of the chunk that may exist at other MemNet devices. A write request is sent onto the network, and each MemNet device examines the request as it passes by. In a "snoopy cache" fashion, the ID of the desired chunk is checked in the write request message. If a copy of the chunk exists locally, then the MemNet device invalidates it. The message then proceeds to the next host. Note that even after the real owner has lled the data in the message, any MemNet device between the owner and the faulting host that has a copy of the chunk has invalidated its copy. The nal case is an invalidation message. This type of message is sent when a faulting MemNet device already contains the chunk it is about to write to; the message is strictly to invalidate duplicates that may exist in the system. Its behavior is very similar to that of write requests, except the data eld of the message is never lled in by remote devices. The replacement policy for MemNet is random. Each MemNet device has a large amount of memory is used to store replaced chunks, in a similar fashion as main memory is used to store replaced cache lines in a caching system. 8.4.5 System Comparison The most unique of the three systems was Clouds, whose object-oriented approach led to dierences in the implementation of DSM basics. For instance, the memory structure of both IVY and MemNet was quite regular; the at address space was composed of xed-size sections of memory (pages, chunks). The memory structure of the Clouds system, on the other hand, is quite irregular, consisting of variablesize segments. It would seem that keeping track of Clouds objects is therefore more dicult in that regular "page table" type constructs cannot be used. The Clouds system also diers from the other two in regard to its page location and invalidation requirements. Clouds objects are basically always associated with their owners, so there is no real page location necessary. Furthermore, since each object is relinquished to its owner when it is discarded, the problem of invalidation is also eliminated. However, Clouds system can not take advantage of locality of reference, as each remote object must be migrated each time it is referenced. With the 32-byte chunks, MemNet exhibits the highest degree of parallelism of the three systems. Furthermore, due to the fact that the hardware-based MemNet IPC can be orders of magnitude faster than the other schemes' software-based IPC, the expected overhead of keeping track of the myriad of chunks in the system is minimized. This speedup is further supported by MemNet's ecient technique of Draft: v.1, April 21, 1994 CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING AND DISTRIBUTED SHARED MEMORY 380 quickly modifying request messages into data messages. All three systems utilize single-writer, multiple-reader protocols in which the most recent value written to a page is returned. Clouds, however, has a weak-read consistency option which returns the value of the object at the time of the read, with no guarantee of the atomicity of the operation. This gives the system the potential for more concurrency,but at the same time places some of the burden of data consistency on the programmer. Draft: v.1, April 21, 1994 Bibliography [1] Ahuja, N. Carriero, and D. Gelernter, "Linda and Friends", ILTE Computer, vol. 19, August 1986, pages 26-34. [2] N. Carriero and D. Gelernter, "The S/Net's Linda Kernel", Proceedings Symposium Operating System Principles, December 1985. [3] L. Dorrmann and M. Herdieckerho, "Parallel Processing Performance in a Linda System", 1989 International Conference on Parallel Processing, pages 151-158. [4] N. Carriero and D. Gelernter, "Linda in Context", Communications of the ACM, vol.32, number 4, April 1989, pages 444-458. [5] R. Finch and 5. Kao, "Coarse-Grain Parallel Computing Using ISIS Tool Kit", Journal of Computing in Civil Engineering, vol. 6, number 2, April 1992, pages 233-244. [6] K. Birman, and R. Cooper, "The ISIS Project: Real experience with fault tolerant programming system", Operating Systems Review ACM, vol. 25, number 2, April 1991, pages 103-107. [7] J. Flower, A. Kolawa, and 5. Bharadwaj, "The Express Way to Distributed Processing", Supercomputing Review, May 1991, pages 54-55. [8] Express Fortran Users Guide, Version 3.0, ParaSoft Corporation, 1990 [9] V. Sunderam, "PVM: A Framework for Parallel Distributed Computing", Concurrency: Practice and Experience, December 1990, pages 315-339 [10] G. Geist and V. Sunderam, "Experiences with network based concurrent computing on the PVM system", Technical Report ORNL/TM-11760, Oak Ridge National Laboratory, January 1991. 381 382 BIBLIOGRAPHY [11] A. Beguelin, J. Dongarra, 0. Geist, R. Manchek, and V. Sunderam, "Graphical Development Tools for Network-Based Concurrent Super-computing", Proceedings of Supercomputing 1991, pages 435-444. [12] M. Stumm and 5. Zhou, "Algorithms Implememting Distributed Shared Memory", Computer; Vol.23, No. 5, May 1990, pp. 54-64. [13] B. Nitzberg and V. Lo, "Distributed Shared Memory: A Survey of Issues and Algorithms", Computer, Aug. 1991, pp. 52-60. [14] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems", ACM Trans. Computer Systems, Vol.7, No. 4, Nov. 1989, pp. 321-359. [15] K. Li and R. Schaefer, "A Hypercube Shared Virtual Memory System", 1989 Inter. Conf. on Parallel Processing, pp. 125-132. [16] B. Fleisch and G. Popek, "Mirage : A Coherent Distributed Shared Memory Design", Proc. 14th ACM Symp. Operating System Principles, ACM ,NY 1989, pp. 211-223. [17] J. Bennet, J. Carter, and W. Zwaenepoel, "Munin: Distributed Shared Memory Based on Type-Specic Memory Coherence", Porc. 1990 Conf. Principles and Practice of Parallel Programming, ACM Press, New York, NY 1990, pp. 168176. [18] U. Ramachandran and M. Y. A. Khalidi, "An Implementation of Distributed Shared Memory", First Workshop Experiences with building Distributed and Multiprocessor Systems, Usenix Assoc., Berkeley, Calif., 1989, pp. 21-38. [19] M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, Coherence, and Event Ordering in Multiprocessors", Computer, Vol. 21, No. 2, Feb. 1998, pp. 9-21. [20] J. K. Bennet, "The Design and Implementation of Distributed Smalltalk", Proc. of the Second ACM conf. on Object-Oriented Programming Systems, Languages and Applications, Oct. 1987, pp. 318-330. [21] R. Katz, 5. Eggers, D. Wood, C. L. Perkins, and R. Sheldon, "Implementing a Cache Consistency Protocol", Proc. of the 12th Annu. Inter. Symp. on Computer Architecture, June 1985, pp. 276-283. [22] K. Li and P. Hudak, "Memor,' Coherence in Shared Virtual Memorv Systems," ACM Trans. Computer Systems, Vol. 7, No.4, Nov. 1989, pp.321-359 Draft: v.1, April 21, 1994 BIBLIOGRAPHY 383 [23] P. Dasgupta, R. J. LeBlane, M. Ahamad, and U Ramachandran, "The Clouds Distributed Operating System," IEEE Computer, 1991, pp.34-44 [24] B. Fleich and G. Popek, "Mirage: A Coherence Distributed Shared Memory Design," Proc. 14th ACM Symp. Operating System Principles, ACM, New York, 1989,pp.21 1-223. [25] D. Lenoskietal, "The Directoiy-Based Cache Coherence Pro to col for the Dash Multiprocessor, "Proc. 17th Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 2047, 1990, pp. 148-159. [26] R. Bisiani and M. Ravishankar, "Plus: A Distributed Shared-Memoiy System," Proc. 17th Int'l Symp. Computer Architecture, WEE CS Press, Los Alamitos,Calif., Order No. 2047,1990, pp.115-124. [27] J. Bennett, J. Carter, And W. Zwaenepoel. "Munin: Distributed Shared Memory based on Type-Specic Memoiy Coherence, "Proc. 1990 Conf Principles and Practice of ParallelProgramming, ACM Press, New York, N.Y., 1990, pp.168176. [28] D. R. Cheriton, "Problem-oriented shared memoiy : a decentralized aproach to distributed systems design ",Proceedings of the 6th Internation Conference on Distributed Computing Systems. May 1986, pp. 190-197. [29] Jose M. Bernabeu Auban , Phillip W. Hutto, M. Yousef A. Khalidi, Mustaque Ahamad, Willian F. Appelbe, Partha Dasgupta, Richard J. Leblanc and Umarkishore Ramachandran," Clouds{a distributed, object-based operating system: architecture and kernel implication ",European UNIX Systems User Group Autumn Conference, EUUG, October 1988, pp.25-38. [30] Francois Armand, Frederic Herrmann, Michel Gien and Marc Rozier, "Chorus, a new technology for building unix systems", European UNIX systems User Group Autumn Conference, EUUG, October 1988, ppi-18. [31] G. Delp. \The Architecture and Implementation ofMemnet: A High-speed Shared Memoy Computer Communication Network, doctoral disseration", University of Delaware, Nework, Del., 1988. [32] Zhou et al., "A Heterogeneous Distributed Shared Memory," to be published in IEEE Trans. Parallel and Distributed Systems. Draft: v.1, April 21, 1994 Chapter 9 Load Balancing in Distributed Systems 385 386 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS 9.1 Introduction A general purpose distributed computer system can be characterized by a networked group of heterogeneous autonomous components which communicate together through message passing. While these systems have the potential to deliver enormous amounts of computing capacity, much of this capacity is untapped because of the inability to share computing resources eciently. For example, in experiments carried out on a cluster of workstations in a campus environment, it was found that the average utilization is as low as 10% [Kreug90]. A high degree of reliability and overall performance can be achieved if computers shared the network load in a manner that would better utilize the resources available. Ecient resource allocation strategies should be incorporated in any distributed system such that transparent and fair distribution of the load is achieved. These strategies can be either implemented at the user level, with the support of a high level library interface or at the micro kernel level. These decision are to be determined by the designers of the system. Livny and Melman [13] showed that as the size of the distributed system grows, the probability of having at least one idle processor also grows. In addition, due to the random submission of tasks to the hosts and to the random distribution of the service times associated with these tasks, it is often the case that some hosts are extremely loaded while others are lightly loaded. This load imbalance causes sever delays on the tasks scheduled on the busy host and therefore degrades the overall performance of the system. Load balancing strategies seek to rectify this problem by migrating tasks from heavily loaded machines to less loaded ones. Although the objectives of load balancing schemes seem fairly simple, the implementation of ecient schemes is not. Load balancing involves migrating a process (a task) to a node, with the objective of minimizing its response time as well as optimizing the overall system performance and utilization. It also refers to cooperation among a number of nodes in processing units of one meta-job, where the set of nodes are chosen in a manner resulting in a better response time. An example of a situation where load balancing is obviously a necessity, is in running one of today's software tools such as CAD packages with sophisticated graphics and windowing requirements [?]. These tools are CPU intensive and have high memory requirements. In a CAD application, a user needs fast responses if he or she modies the design. It is evident that carrying out the CPU-intensive calculations would better performe on a remote idle or slightly loaded workstation while the full power of the user's workstation is dedicated to the graphics and the display functions. Load balancing is also obviously needed in performing parallel computations on a distributed systems. In this chapter we present some general classications and denitions which are often encountered when dealing with load balancing. We discuss static, adaptive and Draft: v.1, April 26, 1994 9.1. INTRODUCTION 387 dynamic and probabilistic load balancing and present some case studies followed by a discussion of the important issues and properties of load balancing systems. We nally present some related work. Draft: v.1, April 26, 1994 388 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS Opimal Approximate Static Sub-optimal Heuristic Load Balancing Centralised Cooperative Dynamic Distributed Non-cooperative Figure 9.1: Casavant's classication of Load Balancing Algorithms 9.2 Classications and Denitions In this section we mainly use [21] to clarify some of the terms and denitions often used in the load balancing literature to present a taxonomy of the dierent schemes employed (see Figure 9.1. Load balancing falls under the general problems of scheduling or resource resource allocation and management. Scheduling formulates the problem from the consumers' (users', tasks') point of view whereas resource allocation solutions view the problem from the resources' (hosts', computers') side. It is a matter of terminology and each is a dierent side of the same coin. In addition, the term global scheduling is often used to distinguish scheduling from local scheduling which is concerned with activities within one node or processor. Our focus here is in global scheduling which we will refer to simply as scheduling. Often, in the literature we encounter the term load sharing in reference to the scheduling problem. In load sharing, the system load is shared by redistributing it among the hosts in the hope of achieving better overall system performance (in terms of response time, throughput or accessibility to remote services). In performing load sharing, it is intuitive to make better use of idle, less busy and more powerful hosts. In doing so, the end result is better utilization of the resources, faster response time and less load imbalance. Therefore, one can think of load balancing as a subset of load sharing schemes. We will be concerned with scheduling strategies which are designed with the goal of reducing the load imbalance to achieve better performance. A survey of the literature uncovers a multitude of algorithms proposed as soluDraft: v.1, April 26, 1994 9.2. CLASSIFICATIONS AND DEFINITIONS 389 tions to the load balancing problem. A rst level classication includes static and dynamic solutions. Static solutions require a complete knowledge of the behavior of the application and the state of the system (nodes, network, tasks), which is not always available. Therefore, a static solution designed for a specic application might ignore the needs of other application types and the state of the environment which may result in an unpredictable performance. It is clear that dynamic load balancing policies which assume very little a priori knowledge about the system and applications and which respond to the continuously changing environment and to the diverse requirements of applications have a better chance of making correct decisions and of providing better performance than static solutions. While early work focused on static placement techniques, recent work evolved to adaptive load balancing. However, static solutions maybe exploited for dynamic load balancing, especially when the system is assumed to be in steady state, or if the system has the capabilities or the tools to determine ahead of time the eectiveness of a certain static solution on a given architecture [Manish's dissertation]. Given long-term load conditions and the network state, static strategies can be applied, while dynamic policies are used to react to short-term changes in the state of the system. In performing both static or dynamic load balancing, optimal solutions, such as graph theoretic, Mathematical programming, or space enumeration and search, are possible but are usually computationally infeasible. In general achieving optimal schedules is an NP-complete problem and only possible when certain restrictions are imposed on the program behavior or the system such as restricting the weight of tasks to be the same and the number of nodes (processors) to be two [40]. or when restricting the node weights to be mutually commensurable (i.e all node weights are an integer multiple of a weight \t"). Alternatively, a suboptimal algorithm can be devised to allocate the resources according to the observed state of the system. Such an algorithm attempts to approach optimality at a fraction of the cost with less knowledge than is required by an optimal algorithm. Most algorithms proposed in the literature are of the latter kind as it has been found that for practical purposes suboptimal algorithms oer a satisfactory performance. Suboptimal solutions which include either approximate or heuristic solutions are less time consuming. Approximate solutions use the same models as the optimal solutions but require a metric which evaluates the solution space and stops when a good solution (which is not necessarily the optimal) is reached. Heuristic solutions on the other hand use rules of thumb to make realistic decisions using any information available on the state of the system. Specic to dynamic load balancing is the issue of the responsibility of making and carrying out the decisions. In a distributed load balancing policy, this work is physically distributed among the nodes. Whereas, in a non-distributed case, this responsibility resides on the shoulder of a single processor. Now, this brings up Draft: v.1, April 26, 1994 390 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS the issue of the level of cooperation and the degree of autonomy in a distributed dynamic scheduling. Nodes can either fully cooperate to achieve a system wide goal or they can perform load balancing independently from each other. In addition, a distinction can be drawn between decentralized and distributed scheduling policies, where one (decentralized) is concerned with the authority and the other (distributed) is concerned with the responsibility of making and carrying out the decisions. Any system which has a decentralized authority must also have a distributed responsibility, but the opposite is not true. On the other hand, in centralized scheduling, a single processor holds the authority of control. Especially, in large scale systems, both centralized and decentralized control can be used. Clusters are a common way for computer systems to exit. Each cluster could be managed by a single centralized node. This creates a hierarchy of centralized managers across the whole system, where control is centralized within each cluster, but is decentralized across the whole system. Another distinguishing characteristic of a scheduling system is how adaptive is it. In an adaptive system, the scheduling policy and parameters used in modeling the decision making process are modied depending on the previous and the current response of the system to the load balancing decisions. In contrast, a non-adaptive system would not change its control mechanism on the basis of the history of the system behavior. Dynamic systems on the other hand simply consider the current state of the system while making decisions. If a dynamic system modies the scheduling policy itself, it is then adaptive also. In general, any adaptive system is automatically dynamic but the reverse is not necessarily true. Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 391 9.3 Load Balancing In the next sections we will be discussing the dierent ways of performing load balancing, mainly, statically, dynamically, adaptively and probabilistically. The sections on Static load balancing and dynamic/ adaptive load balancing are based on a Survey conducted by [Harget and Johnson]. 9.3.1 Static Load Balancing Static load balancing is concerned with nding an allocation of processes to processors to that would result in minimizing the execution cost and minimizing the communication cost due to the load balancing strategy. General optimal solutions are NP-complete, so heuristic approaches are often used. It is assumed that the program or job consists of a number of modules and that the cost of executing a module on a processor and the volume of data ow between modules are known, which is not necessarily the case. Accurate estimation of task execution times and communication delays are dicult. The problem formulation involves assigning modules to processors in an optimal manner within given cost constraints. However, the static assignment does not consider the current state of the system when making the placement decisions. Some of the solution methods include the graph theoretic approach, the 0-1 integer programming approach and the heuristic approach. The Graph Theoretic Approach: In this method, modules are represented as nodes in a direct graph. Edges denote ow between modules and their weights denote the cost of sending data. Intra processor communications costs are assumed to be zero. [Stone (1977)] showed that this problem is similar to that of commodity ow networks, where commodity ow from source to destination and the weights represent the maximum ow. A feasible ow through the network has the following property: X flows IN = X flows out where ows into sinks and out of sources are not-negative, and ows do not exceed capacities. What is required is to nd the ow that has a maximum value among all feasible ows. The Min Cut algorithm achieves that see Figure, where the letters denote the program modules and the weights of edges between modules and processors (s1 and s2) denote the cost of running these modules on each processor. An Optimal assignment can be found by calculating the minimum weight cutset. However, for n-processor network, we need to determine n cut sets which is computationally expensive. An ecient implementation can be achieved if a program has a tree structured call graph. Draft: v.1, April 26, 1994 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS 392 An algorithm which can nd a least costly path through the assignment graph can execute in time O(mn2), where m is number of modules, n is number of processors [Bokhari, 1981]. 10 2 4 6 12 A B F 5 8 4 12 S1 4 4 C 3 S2 6 2 D 3 5 11 5 E The O-1 Integer Programming Approach: For this approach, the following is dened: { Cij : coupling factor = the number of data units transferred form module i to module j . { dkl : inter-processor distance = the cost of transferring one data unit from processor k to processor l. { qik : execution cost = the cost of processor module i on processor k. If i and j are resident on processors k and l respectively, then their total communications cost can be expressed as Cij dkl . In addition to these quantities the assignment variable is dened as: Xik = 1, if module i is assigned to processor k. 0, otherwise Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 393 Using the above notation, the total cost of processing a number of user modules is given as: X X(qikXik + X X(cij dkl)Xik Xjl) i k l j In this scheme, constraints can be added easily to the problem, for an example, memory constraints, as follows: Mi Xik Sk X i Mi = memory requirements of module i Sk = memory capacity of processor k Non-linear programming techniques or branch and bound techniques can be used to solve this problem. The complexity of such algorithms is NPcomplete. The main disadvantage lies in the need to specify the values of a large number of parameters The Heuristic Approach: In the previous approaches, nding optimal solution is computationally expensive. Therefore, heuristic methods are used to perform load balancing. Heuristics are rules of thumb. Decisions are based on parameters which have an indirect correlation with the system performance. These parameters should be simple to calculate and to monitor. An example of such heuristics is to identify a cluster of modules which pass a large volume of data between them and to place them on the same processor. This heuristic can be formulated as follows: 1. nd the module pair with most intermodule communication and assign to one processor if constraints are satised. Repeat this for all possible pairs 2. dene a distance function between modules that measures communications between i and j modules relative to communications between i and all other modules and j and all other modules. This function is then used to cluster modules with the highest valued distance function. Additional constraints can be added such that execution and communication costs are considered. 9.3.2 Adaptive / Dynamic Load Balancing Computer systems and software are developing at a high rate, creating a very diverse community of users and applications with varying requirements. Optimal xed solutions are no longer feasible in many areas, instead, adaptive and exible solutions are Draft: v.1, April 26, 1994 394 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS required. We do feel that it is not possible to nd, for a wide range of applications, an optimal architecture, an optimal operating system or an optimal communication protocol. In general, any adaptive or dynamic load balancing algorithm is composed of the following components: the processor load measurement, the system load information exchange policy, the transfer policy and the cooperation and location policy. These algorithms dier in the strategy used to implement each of the above components; some might require more information than others or might consume less computation cycles in the process of making a decision. A distributed environment is highly complex and many factors eect its performance. However, three factors are of utmost importance to the load balancing activities. These factors are the following: 1. System load: in making load balancing decisions, a node requires knowledge of its local state (load) as well as the global state of the dierent processors on the network. 2. Network trac conditions: The underlying network is a key member of our environment. The state of the network can either be measured by performing experiments or predicted by estimating its state from some partial information. Network trac conditions determine whether more cooperation can be carried out among the dierent nodes in making the load balancing decisions (which requires more message passing or communication) or whether the load balancing activities should rely more on stochastic, prediction and learning techniques (which will be discussed in Section 3.3). 3. Charachteristics of the tasks: these characteristics involve the size of the task which is summarized by its execution and migration time. Also, tasks can be CPU bound, I/O bound or a combination of the two. The task description includes both quantitative parameters (e.g the number of processors required for a parallel applications) and qualitative parameters (e.g level of precision of the result or some hardware or software requirement). Estimating the task execution time is helpful in making the load balancing decision. This estimate can be passed on by the users using their knowledge of the task. Prole based estimates generated by monitoring previous runs of similar applications are also possible. In addition, the run time program behavior can be simulated at run time by executing those instructions which aect the execution of the program the most and gathering statistics about the behavior. Probabilistic estimates have been used also. Once the task characteristics are determined and given knowledge of the state of other nodes, and the network, the eect of running this task locally or on Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 395 any other node can be determined. The components of dynamic load balancing algorithms are discussed below to give some insight into what is required of the solutions of the problem at hand. The processor load measurement: a reasonable indicator of the processor's load is needed. It should reect our qualitative view of the load, it should be stable (high frequency uctuations discarded), it should be done eciently since it will be calculated frequently and it should have a direct correlation with the performance measures used. Many methods designed to estimate the state of the processor have been suggested (e.g Bryan81). The question to be answered is how do we measure a processor's load. Some measurement readily available in a UNIX environment include the following: 1. The number of tasks in the run queue. 2. The rate of system calls. 3. The rate of CPU context switching. 4. The amount of free CPU time. 5. The size of the free available memory. 6. the 1-minute load average. In addition, key board activity can be an indication of whether someone is physically logged on the machine or not. The question is, which of the above is the most appropriate to measure the load. A weighted function of these measures can be calculated. An obvious deciding factor is the type of tasks residing at a node. For example, if the average execution time of the tasks is much less than 1 minute, then the one minute load average measure is of little importance. Or if the tasks currently scheduled on a machine cause a lot of memory paging activity, then the size of the free available memory might be a good indication of the load. Ferrari (1985) proposed a linear combination of all main resource queues as a measure of load. This technique determines the response time of a Unix command in terms of resource queue lengths. The analysis assumes steady-state system and certain queueing disciplines for resources. Studies have shown however, that the number of tasks in the run queue is the best out of all measures as an estimate of the node's load and even better than a combination of all the measures available [Kunz91]. However, as discussed above one might conceive of situations where other measures might make sense. Bryant and Finkel (1981) used the remaining service time to estimate the response time Draft: v.1, April 26, 1994 396 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS of a process arriving at one processor (which is used as an indication of the processor's load) as follows: RE (t) = t (the time that has already received) R = RE (tK ) for all J 2 J (P ) Do begin if RE (tJ ) < RE (tK ) then R = R + RE (tJ ) else R = R + RE (tK ) end RSPE (K; J (P )) = R Where J (P ): set of jobs resident on processor P. Workload characterization is essential for this component. In addition, stability needs to be maintained such that the cost of load balancing does not outweigh its benets over a system using no load balancing. Means of establishing the age of the information is also essential so that stale information, which can cause instability, is avoided. A load measure with reduced uctuations is obtained if the load value is averaged over a period at least as long as the time to migrate an average process [Kru84]. Additional stability is also introduced by using the idea of a virtual load, where the virtual load is the sum of the actual load plus the processes that are being migrated over to that processor. The system information exchange policy: it denes the periodicity and the manner of collecting information about the state of the system (e.g short intervals, long intervals or when drastic changes occur in the system state), i.e it answers the question when should a processor communicate its state to the rest of its community. Some heuristics are required to determine the periodicity of information collection which is a function of the network state and variance of the load measurement. In situations of heavy trac, the nodes should refrain from information exchange activity and resort to estimating the system state using local information. So, one aspect of this policy involves estimating locally the state of the trac on the network to determine the level of communication activity that would be reasonable. This can be done by performing experiments, such as sending a probe around the network and measuring the network delay, or by predicting the state of the network using previous information about the trac conditions on the network. Another aspect is stability, which can be Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 397 maintained by recognizing transient information, keeping variance information and averaging over meaningful periods. Dierent types of information can be exchanged. One type is status information such as busy, idle or low neutral and heavy load [Ni85]. This is high level information from which it would be hard to estimate the future state of a node. On the other hand, data representing the instantaneous low level or short term load information can be exchanged. However, this type of information vary at a high rate. Extracted long term trends, which usually do not uctuate at a high frequency, are good candidates. Dierent approaches have been adapted for the state information policy, we summarize some of them: { The limited Approach: when one processor is overloaded, a number of random processors are probed to nd a processor to which processes can be o loaded. Simulation results have shown that this approach improves performance for a simple environment of independent, short-lived processes. [Eager et al. 1986] { The Pairing Approach: in this approach [Bryant and Finkel, 1981], each processor cyclically sends load information to each of its neighbors to pair with a processor that diers greatly from its own. The load information consists of a list of all local jobs, together with jobs that can be migrated to the sender of the load message. { The Load Vector Approach: In this approach, a load vector is maintained which gives the most-recently received value for a limited number of processors. The load balancing decisions are based on the relative dierence between a processor's own load and those loads held in the load vector. The load of a processor can be in one of three states: light, normal or heavy. The load vector of a processor neighbors is maintained and updated when a state transition occurs: [Ni, 1985] L N, N L, N H, H N. In order to reduce the number of messages sent the L load message is sent when N L transition, only if previous state was heavy. In addition broadcasting only the N H transitions and only notifying neighbors of H N transition when process migration is negotiated reduces the number of messages sent. { The Broadcast Approach: in the Maitre d' systems [Bershad 1985], one daemon process examines the Unix ve-minute load average. If the processor can handle more processes, it will broadcast this availability. Every time the processor state changes, this information is broadcasted. This method improves performance if the number of processors is small. Another alternative is to broadcast a message when the processor becomes Draft: v.1, April 26, 1994 398 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS idle. This approach works eciently if the network uses a broadcast communications medium. { The Global System Load Approach: processors calculate the load on the whole system and adjust their own load relative to this global value . When the processor load diers signicantly from the average load, load balancing is activated. The dierence should not be too small and also not too large. In the Gradient model algorithm [Lin and Keller, 1987], the global load is viewed in terms of a collection of distances from a lightly loaded processor. The proximity (Wi) of a processor is calculated as its minimum distance form a lightly loaded processor is determined as follows: Wi = min fd ; g = 0g if 9K jgK = 0 K iK k Wi = Wmax; if 8K; gk = Wmax Wmax = D(N ) + 1 D(N ) = maxfdi;J ; 8i; J 2 N g The global load is them represented by a gradient surface; Gs = (W1; W2; ::: Wn) This method, gives a route to a lightly loaded processor with a minimum cost. The transfer policy: this component deals with the questions of deciding under what conditions is the migration of a process to be considered and which processes (large, small or newly arriving) are best t for migration. Static threshold values have been used where processes are o loaded to other processors when the load goes beyond a certain threshold which is chosen experimentally. On the other hand, an under loaded processor could seek to accept processes form other peer processors when either the processor becomes idle or when normal to light load transition occurs. It is common to consider only newly-arriving processes however other processes might benet more from the migration. [Kreuger and Finkel (1984)] proposed the following: Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 399 1. Migration of a blocked process may not prove useful, since this may not eect local processor load. 2. Extra overhead will be incurred by migrating the currently scheduled process. 3. The process with the best current response ratio can better aord the cost of migration. 4. Smaller processes put less load on the communications network 5. The process with the highest remaining service time will benet most in the long-term from migration. 6. Processes which communicate frequently with the intended destination processor will reduce communications load if they migrate. 7. Migrating the most locally demanding process will be of greatest benet to local load reduction. Some of the above considerations might be conicting, for example, a small process (in terms of size and computation time) might put less load on the communication network at the same time a big process will benet most in the long-term from migration, resolving these conicting solutions requires developing rules that incorporate transfer policy criterion. Maintaining stability is also an important issue under this policy. One scenario involves a task being migrated from one node to the other many times. However, limiting the number of migrations to a predened value could be a conservative approach if a task can aord these migrations. The cooperation and location policy: it involves choosing methods of cooperation between processors to nd out where is the best location for migrating a process. Due to the network conditions the levels of cooperation can be adjusted. A reasonable approach is to rely on cooperation and controlled state information exchange in situations of low and medium trac and load conditions. On the other hand, nodes should refrain from information exchange in conditions of high loads and trac and rely more on making as many decisions as possible locally, by applying methods of prediction and inferring using uncertain or not up-to-date information [Pasqu86]. Methods of cooperation include the sender initiated (an overloaded processor initiates locating a suitable under-loaded processor, among neighbours or across the whole network), receiver initiated (the reverse of the previous method), or the symmetric approach (which has both the sender and the receiver initiated components). Many studies were carried out to compare these dierent schemes. Draft: v.1, April 26, 1994 400 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS For an example, it was found that under low to moderate loading conditions the sender initiated performed better, whereas the receiver initiated performed better under heavy load [Eage85]. One major disadvantage of receiver initiated algorithms is that it requires preemptive task transfers, which are expensive because they usually require saving and communicating the current state of the process. One might note that under low load conditions, the dierence between the performance of a variety of strategies is not important. In addition, low load conditions allow us the freedom to attempt some interesting strategies like for instance, running the same job on many idle machines and simply waiting for the fastest response. 1. Sender Initiated approaches: initiating load-balancing from an overloaded processor is widely studied. [Eager (1986)] studied 3 simple algorithms where the transfer policy is a simple static threshold: (a) choose a destination processor at random for a process migrating from a heavily-loaded processor (the number of transfers is limited to only one). (b) choose a processor at random and then probe its load. If it exceeds a static threshold, another processor is probed and so on until one is found in less than a given number of probes. Otherwise, the process is executed locally. (c) poll a xed number of processors, requesting their current queue lengths and selecting the one with the shortest queue. Performance of these algorithms were evaluated using K independent M/M/1 queues to model the no load balancing case. M/M/K queue to model the the load balancing case. All algorithms provided improvement in performance. The threshold and the shortest queue provided extra improvement when system load is beyond 0.5. This study showed that simple policies are adequate for dynamic load balancing. [Stankovic (1984)] proposed three algorithms which are based on the relative dierence between processor loads. The information exchange is through periodically broadcasting local values: (a) choose least-loaded processor if the load dierence is larger than a given bias. { if the dierence > bias 1, migrate one process. { if dierence > bias 2, migrate two processes. (b) similar to (a), except no further migration to that processor for a given period 2t. Estimating parameters (bias 1, bias 2 and 2t) is a dicult problem. Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 401 [Kreuger and Finkel, 1984] used a global-agreed average load value where when when a processor becomes overloaded, it broadcasts this face. An under loaded processor responds and indicates number of processes that can be accepted and adjusts its load by the number of processes which it believes it will receive. If the overloaded processor receives no response, it assumes that the average value is too low and increases this global value and then broadcasts it. This algorithm adapts quickly to uctuations in the system load. In the Gradient Model load balancing algorithm [Lin and Keller, 1987], each processor calculates its own local load, if the local load is light, then it propagates 0 pressure to its neighbors, if the load is moderate, the propagated pressure is set to one greater than the smallest value received from all its neighbors. In this algorithm, global load balancing is by a series of local migration decisions. Migration occurs only at heavy loads and processes migrate toward lightly loaded processors. 2. Receiver Initiated approaches: [Eager et. al, 1985] have studied these variations: (a) When the load on one processor falls below some static threshold (T), it polls random processors to nd one where if its process is migrated the load becomes > T. (b) similar to (a), but instead of migrating the currently running process, a reservation is made to migrate the next newly-arriving process, provided there are no other reservations. Simulation results showed that it does not perform as well as (a). For broadcast networks, [Livny and Melman, 1982] proposed two receiverinitiated policies: (a) A node broadcasts a status message when it becomes idle and receivers of this message carry out the following actions: Assuming ni denotes the number of processes executing on a processor i: i. If ni > 1 continue to step ii, else terminate algorithm. ii. Wait D=n time units, where D is a parameter depending on the speed of the communications subsystem; by making this value dependent on processor load, more heavily-loaded processors will respond more quickly. iii. Broadcasting a reservation message if no other processor has already done so (if this is the case terminate algorithm). iv. Wait for reply Draft: v.1, April 26, 1994 402 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS v. If reply is positive, and ni > 1, migrate a process to the idle processor. (b) This broadcast method might overload the communication medium, so a second method is to replace broadcasting by polling when idle. The following steps are taken when a processor's queue length reaches zero. i. Select a random set of R processors (ai; ::::; aR) and set a counter j = 1. ii. Send a message to processor aj and wait for a reply. iii. The reply from aj will either be a migrating process or an indication that it has no processes. iv. If the processor is still idle and j < R, increment j and go to step ii else stop polling. 3. Symmetrically Initiated Approach: at low loads, if the receiver initiated policies are used, the load balancing activity is delayed since the policy is not initiated as soon as a node becomes a sender. Under heavy loads, the sender initiated policy is ineective since resources are wasted in trying to nd an under loaded processor. A symmetrically initiated approach, takes advantage of both policies at their corresponding peak performance and therefore performing well under all load conditions. Unstable conditions might result under the location policy. Under active load uctuations, stable adaptive symmetric algorithms are needed where system information is used to the fullest to adapt swiftly to the changing conditions and to prevent unstable conditions. In addition, a situation might arise where a number of nodes make the same choice of where to send their jobs. This is an example of optimal local decisions rendered non-optimal when measured on a global scale. This can be avoided in many ways, one involves making more than one choice that are comparable in terms of quality and then choosing randomly one of them to avoid conicts. From the above discussion of the four load balancing components, we can determine some required characteristics of the possible solutions. These include the ability to make inferences and decisions under uncertainty, the ability to hypothesize about the state of the system and to perform tests and experiments to verify a hypothesis, the ability to resolve conicting solutions and make a judgment of what is possibly the best solution, the ability to reason with time and to check the lifetime of information, the ability to detect trends and learn from the past experiences, and the ability to make complex decisions using available and predicted information under stringent Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 403 time limits. This leads us to the next section where we discuss probabilistic load balancing. 9.3.3 Probabilistic Load Balancing In performing load balancing, a node makes local decisions which depend on its own local state and the states of the other nodes. In the distributed environment considered, the state information is shared by message passing which cannot be achieved instantaneously. In addition, The state of the nodes are constantly changing. Therefore, acquiring up-to-date state information about all components of the network (global system state) can be expensive in terms of communication overhead and of maintaining stability on the network. A reasonable approach is to rely on cooperation and controlled state information exchange in situations of low and medium trac and load conditions. On the other hand, nodes should refrain from information exchange in conditions of high loads and trac and rely more on making as many decisions as possible locally, by applying methods of prediction and inferencing using uncertain or not up-to-date information. In general, for a typical local area network or cluster of nodes, a large amount of data reecting the state of the network and its components is available. In such systems, procedural programming becomes too cumbersome and too complex since the programmer must foresee every possible combination of inputs and data values in order to program code for these dierent states. The overhead of performing the task of load balancing can be tolerated and the quality of service requirements of the applications can be met, only if the process of decision making and data monitoring is performed at a high rate. Therefore, the decision making process is fairly complex mainly due to the uncertainty about the global, and to the large amounts of data available and the diverse state scenarios that can arise, which require dierent possible load balancing schemes. Analytically examining all the possible solutions is a fairly time consuming task. Probabilistic scheduling where a number of dierent schedules are generated probabilistically have been used Team decision theory can be applied to the load balancing problem, where it is possible to cooperate and to make decisions concerning the best schedules, in collaboration with other similar entities. Uncertainty handling and quantication is at the heart of decentralized decision theory as applied to the load balancing problem. Bayesian probability [Stank81], where a priori probabilities and costs associated with each possible course of action are used to calculate a risk function or a utility function. Minimizing the risk function or maximizing the utility function, results in what is often referred to as a likelihood ratio test where a random variable is compared to a threshold and a decision is made. In Bayesian methods, point probabilities are applied qualify information with condence measures and to determine the most likely state of the system. From the inferred Draft: v.1, April 26, 1994 404 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS state, a scheduling decision can be made. This approach has its merits, since decision theory and utility theory are based on point probabilities. In addition the probability functions of the random variables needed in the decision making process, can be built by observing the system in real time. Other possibilities for representing uncertainty include the the intervals of uncertainty used in the Dempster-Shafer theory of belief functions, and the linguistic truth-values used by fuzzy logic. Once the uncertainty about the system state information is quantied and qualied, parametric models can built to represent the decision making process. However, there isn't a single model that can represent such a complex system for all cases. Instead, multiple models need to be constructed, each applied in specic settings. Parameters for those models should adapt to the system state and can be learnt by applying neural network technology. In addition, some information is unstructured and cannot be generalized in such models. Heuristics and special case rules can be applied to express such information. These facts has led to research in the use of knowledge-based techniques, including expert systems where the case-specic knowledge is represented in the form of rules. A rule-based approach supports modularity which adds more exibility to modifying or replacing rules. Another advantage to rule-based programming is that the knowledge and the control strategy used to solve the problem, can be explicitly separated in the rules, in comparison to procedural programming where the control is buried in the code. Some of the knowledge-based systems' characteristics include the following: [NASA91] 1. Continuous operation even when a problem arises. Diagnosis is done in conjunction with continued monitoring and analysis. 2. The ability to make inferences and to form hypotheses from simple knowledge and to attach degrees of belief in them by testing them. 3. Context focusing where depending on the situation at hand, a context is dened, to expedite decision making, where only certain rules and data is considered. 4. Predictability especially for complex situations, where it is dicult to predict the amount of processing required to reach a conclusion. Therefore, in expert systems, heuristics are available to ensure an acceptable conclusion, not necessarily the optimal, in a reasonable time. 5. Temporal reasoning which is the ability to reason with a distinction between the past and the present. 6. Uncertainty handling where the expert system can validate data and reason with uncertainty. Draft: v.1, April 26, 1994 9.3. LOAD BALANCING 405 7. Learning capabilities where the expert system incorporates mechanisms for learning about the system from historical data and for making decisions based on trends. 8. Ability to cooperate and make decisions in collaboration with other similar entities (team decision theory in distributed articial intelligence). Draft: v.1, April 26, 1994 406 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS 9.4 Special Issues/ Properties Load balancing utilities must possess the following properties: Eciency: the overhead of decision making, information exchange and job migration should not exceed the benets of load balancing. Stability: as discussed in previous sections, maintaining stability is an important consideration in all of the load balancing components. Scalability: the overhead of increasing the pool of available processors should be tolerable and should not aect the eciency of the system. Congurability: a system administrator should be able to recongure the system to the current needs of users and applications. General purpose: the system should be unrestricted to service a diverse community of users and applications. In addition to the above properties, we will discuss certain issues that are of importance to load balancing. Heterogeneity: as discussed in [Utopia], heterogeneity takes many forms. One is congurational heterogeneity, where processors may dier in their capabilities of processing power, disk and memory space, the architectural heterogeneity which restricts code from being executed on dierent machine, and the operating system heterogeneity. Across a computer network, all three types of heterogeneity might exist. A load balancing system should be able to deal with this property and to take advantage of it. Taking advantage of congurational heterogeneity is at the heart of the load balancing activity. Architectural heterogeneity should be made transparent to the user who should be able to use all the dierent applications whether they are available on her/his machine or not. And nally, the load balancing system, should have a uniform interface with the operating system irrespective of the operating systems' heterogeneity. Migration: migration is dened as the relocation of an already executing job (preemptive as opposed to none preemptive where only newly arriving processes are considered for transfer). There is an overhead involved in that process and it includes saving the state of the process (open les, virtual memory contents, machine state, buers' contents, current working directory, process ID etc...), communicating this state to the new processor and migrating the actual process and initiating it on the new processor. Many techniques are applied and they Draft: v.1, April 26, 1994 9.4. SPECIAL ISSUES/ PROPERTIES 407 include entire virtual memory transfers, pre-copying, lazy copying (copy on reference) etc... So when is migration feasible? This questions has been addressed in the literature [Krueger and Livny 1988], and the answer depends relatively on what kind of le system is involved (shared le system, local secondary storage or/and replicated les). In general however, in a loosely coupled system where communication is through message passing, the process state has to be transferred via the network and in most cases it is very expensive task. Decentralized Vs. Centralized an important aspect of the load balancing problem is the issue of using either centralized or decentralized solutions. Both approaches were studied in the literature. Decentralized solutions are more popular due to the following reasons: 1. An inherent drawback in any centralized scheme, is that of reliability, since a single central agent represents a critical point of failure in the network. This is opposed to decentralized resource managers where if one fails, only that portion of the resources on that specic machine become unavailable. Also, centralized but replicated servers are more expensive to keep if they are to survive individual crashes. 2. The load balancing problem itself may be a computationally complex task. A centralized approach towards optimization ignores the distributed computational power inherent in the network itself and instead utilizes only the computing power of a single central agent. 3. The information required of each step of an application may itself be distributed throughout the system. With a high rate of load balancing activities a centralized agent will become a bottleneck. From the above, it is clear that in order to avoid the problems endemic to centralized approaches, we should focus our design eorts on solutions that are as decentralized as possible. Although decentralized solutions are usually more complex, this complexity can be overcome. A reasonable approach which makes use of the natural clusters that exist in computer networks, is to build semidecentralized systems in which one node in each cluster of nodes can act as a central manager. Therefore o loading the nodes in the cluster from the tasks of monitoring and information collection and distribution, but at the same time achieving a relatively high degree of decentralization across the whole network (many clusters). Predictability: in a distributed system, the task of load balancing is more complicated than in other types of distributed systems, because of the high degree of predictability in the quality of service that is provided to the users. Draft: v.1, April 26, 1994 408 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS For one thing, the process of harnessing all the computing power in a distributed environment should be transparent to the user, to whom the system would appear as a single time-shared machine. In addition, the user should not be expected to make many changes if any to his/her code in order to be able to use the load balancing facilities. The CPU is considered the the most contended resource. The primary memory is another resource to be considered. I/O congurations dier from one system to another and therefore it is dicult to capture I/O requirements in a load balancing model. A node owner (whomever is working on the console of the node) should not suer from less access to the resources (CPU and memory) than was available by the autonomous node before load balancing was performed. Conservative approaches are possible to preserve predictibility of service where a node is considered available for foreign processes (those executing at a node but not initiated by the owner of that node) only if no one is logged on the console for a period of time. Therefore, any foreign process that is executing when an owner begins using a node is preempted automatically and transferred from the node to a holding area, from which it is transferred to an idle node, when it is located. This conservative approach, is costly and is not ecient in making use of the computing capacity of the distributed systems. The goal here is to allocate local resources so that processes belonging to the owner of the node get whatever resource they need, and foreign processes get whatever is left. Therefore, a more exible approach to preserving the predictability of service is is to prioritize resource allocation locally on the owner's node by performing local scheduling for the CPU and possibly the main memory. Such schemes have been studied in [The Stealth] and were shown to be very eective at insulating owner processes from the eects of foreign processes and at the same time such schemes make better use of nearly all unused capacities and they also reduce overhead by avoiding unnecessary preemptive transfers. Local scheduling is necessary for the ecient preservation of the predictability of service and for the better use of unused capacity. Very little work has been done in terms of local scheduling where foreign and local processes are distinguished from one another. The local scheduling performed on time sharing uni-processor system technologies can be applied in this area. Mechanisms for controlling the priority of a process dynamically will depend on factors such as whether it is executing on a local or a remote cite or if it is part of a parallel synchronous application. Draft: v.1, April 26, 1994 9.4. SPECIAL ISSUES/ PROPERTIES 409 9.4.1 Knowledge-based Load Balancing In the previous section we discussed the characteristics of knowledge-based system or expert systems which oer formidable solutions to the diculties encountered in performing the load balancing tasks. In this subsection, we will discuss some of our research ideas. We are investigating rule-based techniques for developing an expert system environment, in which heuristic knowledge and real-time decision making are used to manage complex and dynamic systems in a decentralized fashion. An expert system manager integrates all the dierent schemes available for implementing the dierent load balancing components in a knowledge base. By using the state information contained in the data base (operating systems are full of statistics about the state of the system's resources, queues etc ..., which are readily available) decisions can be inferred, in an adaptive and dynamic manner, about the best possible schemes. These selected schemes become the building blocks of the current load balancing strategy. In general we would like our environment to support the following: (1) monitoring the state of the system, (2) constructing load balancing schemes dynamically from already existing parts, (3) mapping the load balancing schemes onto the most ecient and available architectures and (4) performing diagnosis on these schemes and initiating any necessary correcting measures. Having determined the features required of our environment, it is now possible to design an architecture with the following main components see Figure: The data base which contains information about the distributed system in the form of facts. The knowledge base which has the rules that implement the four components of dynamic load balancing schemes. The load balancer which contains the system control. The learner which observes past trends and formulates new facts and hypotheses about the system, it can also perform experiments to test them. The predictor which uses state information contained in the data base and estimates current states using analytical models of the system. In addition neural networks can be used to perform predictions. The model or neural network parameters can be determined experimentally and can be dynamically changed to adapt to the varying conditions. The critic which is developed to improve the collected knowledge, to doubt, monitor and repair expert judgment about the system. Draft: v.1, April 26, 1994 410 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS The explanation facility which explains the reasoning of the system, and the knowledge acquisition facility which provides an automatic way to enter knowledge in the system. These facilities are accessed, by for example a system administrator, via the external interface. The main modes of functionality of our environment are the load balancing scheme composition and execution. In the composition phase, the system state parameters are obtained from the expert system data base. In addition, the application description is passed to the expert manager for each application or job. This description includes both quantitative and qualitative parameters. Some applications might require a specic algorithm that is requested by an established name. Our system should be able to synthesize this algorithm using libraries that describe it. Using the application characterization, and the local and the global system state description, the dierent load balancing components are recongured by establishing their respective thresholds and the levels of their functionality. Our goal is to devise methods to select dynamically at run time, the best scheme to implement each component according to the application requirements and the system state information. Each selected scheme becomes a building block from which a nal load balancing strategy that suits the application and network state is composed. See Figure. As described before these policies include the method of measuring the load (how), the periodicity of information collection (when), the type of processes or applications to migrate (which) and how to cooperate and choose the execution locations (where). The policies adapt to the environment by ring the appropriate rules for each of the above components. Some of these rules are as simple as initiating a state broadcast to other nodes if the node decides it needs to and that the conditions on the network allow for that. Other rules are more complex and are designed to resolve conict by attaching weights or priorities to the conicting solutions, where the weights reect the dynamic situation on the network. In the execution phase, the resulting composed or requested scheme is executed. In addition, with each job/ metajob, a job manager process is started to monitor the execution and report the results to the application initiator. This process ends with the completion of the application. Draft: v.1, April 26, 1994 411 9.4. SPECIAL ISSUES/ PROPERTIES Network Critic Data Base Learner Predictor Explanation Facility Network Job Queue Load Optimizer External Interface Dispatcher Location Policy Node Network Load measurement Transfer Policy Info-Exchange Policy Knowledge Base Knowledge Acquisittion Draft: v.1, April 26, 1994 412 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS Load measurement Info exchange policy Trasfer policy Cooperation policy S_1 S_2 S_1 S_2 S_1 S_2 S_1 S_2 S_3 S_3 S_3 S_3 S_#: Strategy #, a building block from each component S_1 S_1 S_2 S_3 S_3 S_3 S_2 S_1 S_3 S_2 S_3 First Load Balancing Second Load Balancing Third Load Balancing Scheme Scheme Scheme S_1 Time Draft: v.1, April 26, 1994 9.5. RELATED WORK 413 9.5 Related Work 9.5.1 J. Pasqual's Model (theoretical state space model) In this work a decentralized control system is modeled as a directed graph with nodes representing agents, and links representing inter-agent inuences. The model of an agent Ai is given by an 8-tuple which are the Ai's state, Ai's generated work, the work distribution, Ai's transferred work, Ai's global state information, the rest of the agents' inuence on Ai (which is the work transferred from the rest of the agents to Ai and Ai's global state information, Ai's decision space, Ai's decision rules, and Ai's means of establishing the next state or the state transition. These tuples will be dened in a more rigorous mathematical manner as needed in our discussion. For a load balancing problem, the decision space D = d1; d2; :::; dK (assuming each agent has the same decision space for simplicity), will contain such decisions as transfer job or not, communicate information or note, request information or not etc.... In general, it is desirable to make decisions that would minimize a general stepwise loss function L(t) which consists of loss due decision quality degradation (due to aging of information for instance), loss due to communication overhead, loss due to time spent evaluating the decision rule and loss due to random eects because of the stochastic nature of the system. Low-level state information Xi is much too large and its level of detail is too cumbersome to deal with (store, communicate etc ...) and is unnecessary. Therefore, an agent uses an indicator, I (xi) which is a readily accessible portion of the low-level state, such as the value of a single memory location in a computer (like the CPU queue length ), or a small set of instructions which compute a value. Hence the abstract state Yi can be dened as function of this indicator (like an average queue length for instance). Indicator values that change rapidly would not work well. However, a common scenario is that the indicator values uctuate rapidly about a a more slowly changing fundamental variation. If a sampled sequence of the indicator values are considered and treated it as a time series, the high frequency components can be ltered out leaving only the fundamental components by the use of moving average or autoregressive techniques (short-term averaging). These sampled, ltered out and rounded values are called Bi (a discreet version of Yi). Bi is limited because of the niteness of the memory of real machines up to Bmax. Homogeneous machines are assumed, therefore they all have the same limitation. Also, it is assumed that the load does not change in a continuous fashion, rather, it remains constant at some load level for an unpredictable interval of time, after which it changes to a new load level. By taking the a moving average of Bi over a certain number of periods a long term average or Draft: v.1, April 26, 1994 414 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS a load level Li is achieved. A measure of the degree of variability Vi , of Bi about Li is calculated by taking the moving average of absolute dierence between the two for a certain number of past load values. The model used has the Markovian property: the probability distribution of the next state, given the past states and the current state depends only on the value of the current state, not on those of past states. A rst order Markovian model can be conveniently represented as a matrix of one-step transition probabilities P, with the n-step transition matrix given by P n. The model should allow an agent Ai to predict the possible states of agent Aj , based on Ai most recent reception of information about Aj . Agents exchange their load levels and the degree of variability which is more valuable than simply the value of Bi as that value may change very quickly. The state transition model is given by: p(Bj (n) = j Lj (n ; k) = ; Vj (n ; k) = v) = [Pvk ] where [Pvk ] is the element in row and column of the k-step state transition probability matrix Pvk , Pv is the one ; step transition matrix of size Bmax Bmax, where the probability of remaining in the same state after one transition is v (except for states 0 and Bmax, for which this probability is 1+2v . v is a decreasing function of Vj (hence the subscript v) which is determined experimentally. This model is called the steady state model, in that in making any predictions it assumes that the remote agent's load is at the same load level since the reception of the load level information. The model was validated using simulation. The parameters used were determined experimentally from observing a distributed system. The results show improvement on job response time up to %67. 9.5.2 Dynamic Load Balancing Algorithms The dynamic load balancing investigated in this subsection is an algorithm proposed by Anna Hac and Theodore J. Johnson. This load balancing algorithm is found in a distributed system consisting of several hosts connected by a local area network. The le system of which is modeled on the LOCUS distributed system. This le system allows replication of les on many hosts. A multiple reader, single writer, synchronization policy with a ccntralizcd synchronization site, is provided to update remole copies of Liles. Their algorithm uses information collected in the system for optimal process and read site placemcnt. The model of a host and a distrihuted system with N hosts is shown in Figure 9.2. The system is modelled after an open queuing local area network that consists of Draft: v.1, April 26, 1994 9.5. RELATED WORK 415 several interconnected queued servers. Each CPU has a round-robin scheduling algorithm and the disks and networks service tasks as they arrive. The service time distribulion for the CPU and disks is uniform. This algorithm bases its process placcment decisions upon information about thc distribution of work in the system. Periodically, a token is entered into the system, collecting workload measurements on every host and distributing these measurements to every host. The measurements it collects are: queue length of each server, percentage of use and the number of johs using each resource. When a request for execulion of a job arrives in the distrihuted system, the execution site is selected to balance the load on all hosts. If a remote host site is selected as the execution site, a message is sent to the remote host telling it to start the job. Otherwise, the job is started on the local host. The Workload Model For the simulation used to determine the eectiveness of the load balancing algorithm, three types of workloads were used to cover the range of system service requiremenLs of a real system. The three workload types simulaled were 1/0 bound, CPU bound and and combination of CPU and I/O hound. There were eight process types found in the simulation. Each process utilized some portion of the CPU, disk and network. Some of the proccsses were, like the workload models, CPU bound, I/O bound or some combination of both. The system was simulated by determining the sequence of servers that each job type would visit and specify the servcrs to accommodate the chains. The algorithm that chooses the execution site and the read site uses vectors of workloads and host characteristics for each host. For each possible selcction site, a vector is constructed. Vectors are constructed such that the longer the vector, the worse the the selection choice. The host with the shortest vector is chosen. All hosts are included in the selection of an execution site. When choosing a selection site for the process, the load balancing algorithm considers the workload characterization and seeks to minimize system work. The workload characterization is determined by the system data collected by the token which was passed through the system. For process placement, the load balancing algorithm used CPU utilization information from each processor. There are three considerations for process placement used by the load balancing algorithm, "is the host being considered for job placement the host that requested the job?", "is the job accessing a le, and is the le stored at the host being considered?" and "is the job interactive and is the terminal at the host being considered?". All these considerations go towards determining optimum process placement. A weight is associated with each parameter, CPU utilization and each of the three process placement considerations, when calculating the optimal Draft: v.1, April 26, 1994 416 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS job departure network request high priority job arrival newtwork request low priority cpu terminal 1 terminal n disk high priority low priority a) the model of a host host 1 host 2 network host n b) the model of a distributed system Figure 9.2: The model of a host and of a distributed system Draft: v.1, April 26, 1994 9.5. RELATED WORK 417 site placement. Some characteristics weigh more heavily in deterrnining the optimal process placement[32]. The algorithm that chooses the read site placement the load balancing algorithm considers the workload characterization and seeks to minimize system work in a similar way that the algorithm chooses the selection site. The workload characteristics for read site placement are, disk queue length, disk utilization and the number ot jobs accessing a le on the disk. The only system work minimization characteristic used is, "is the le needed stored locally?". The dimensions of the vectors correspond to factors that indicate the optimality of that site for selection. These dimensions were scaled with weights used to tune the algorithm. This was done to reLlect the relative importance of the dimensions and to allow for dierences in the ranges ot the measurements. The tunable vector dimensions were used to enable the ability to explore the factors and their signicance in the process placement decision. Given all the information collected about the system and the availability and utilization ot local resources tor each processor the host selection algorithm is as shown[32]: 1. for every host h being considered as a placement choice. 2. for every workload characterisric being considered 3. w(h) = w(h)+((weightforworkloadcharacteristic) (workloadcharacteristic)2 4. for every work minimization characteristic being considered 5. If the host being considered does not meet the work minimization condition, then 6 w(h) = w(h) + (weightforworkminimizationcondition)2 7 Choose the host, k, such that k = \k : w(k) = min(w(h)) Given the tunable nature of this algorithm the authors experimented heavily with determining the best parameters for a balanced load. 00 00 00 An Adaptive Distributed Load Balancing Model A multicomputer is represented by n processor nodes Ni, 0 < i < n, interconnected by a network characterized by a distance matrix D = dij , where dij shows the number of hops between node Ni and Nj . It is assumed that dii = 0 and dij = dji for all i and j . The immediate neighborhood of node Ni is dened by the subset \Gi = Nj jdij = 1 (see Figure 9.3). An adaptive model of distributed load balancing is shown in Figure ??. The host processor is connected to all nodes. The load index li of Ni is passed from each node Ni to the host and the system load distribution L = \\li j0 =< i < n\ , is broadcast to all nodes on a periodic basis. All nodes maintain their own load balancing operations independently and report their load indices to the host on a regular basis. At each node Ni, a sender-initiated load balancing method is used, where heavily loaded nodes initiate migrating processes. The sender-initiated method has the 00 00 00 00 Draft: v.1, April 26, 1994 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS 418 Host Processor I0 µ1 µ0 λ1 λ0 N0 O0 I1 N1 O1 µn-1 λn-1 .... In-1 N n-1 O n-1 Interconnection Network(Message Passing) Figure 9.3: An adaptive load balancing model for a multicomputer with n processor nodes and a host processor advantage of faster process migration, as soon as the load index of a processor node exceeds a certain threshold that is updated periodically according to the variation of the system, load distribution. The distributed load balancing for each node Ni is represented by the queuing model shown in Figure 9.5. Heuristic Process Migration Methods Four heuristic methods for migration processes are formally introduced below. These heuristics are used to invoke the migrating process, to update the threshold, and to choose the destination nodes for process migration. These methods are based on using the load distribution Lt, which varies from time to time. Two attributes (decision range and process migration heuristics) are used to distinguish the four process migration methods. 1. Localized Round-Robin (LRR) Method Each node Ni uses the average load among immediate neighboring nodes to update the threshold and only migrates processes to the neighboring nodes. The Round-Robin discipline is used to select a candidate node for process migration. 2. Global Round-Robin GRR) Method Each node Ni uses a globally determined threshold and migrates processes to any appropriate node in the system. The selection from the candidate list for process migration is based on the Round-Robin based discipline. After receiving the load distribution Lt from Draft: v.1, April 26, 1994 9.5. RELATED WORK Node 1N 419 I1 Q1 Decision Maker Server λ1 µ1 Qm O1 Figure 9.4: A queueing model for the dynamic load lalancing at each distributed node. N1 Service Q D S Migration Q N3 N0 Service Q D Service Q S Migration Q D Service Q D S Migration Q S Migration Q N2 Figure 9.5: An example of the open network load balancing model (D: decision maker, S: server) Draft: v.1, April 26, 1994 420 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS the host, the glopal threshold is set to the system average load among all the nodes. 3. Localized Minimum Load (LML) Method The way to determinate the threshold and to set up migration ports is the same as that in LRR. The dierence between LML and LRR is in the policy to select a destination node. At node Ni, there is a load table to store the load index with the minimum load index in the load table as a destination node. After a process is migrated to the selected node, its load index in the load table is incremented accordingly. 4. Global Minimum Load (GML) Method The setup of the threshold and migration ports is the same as that in GRR. But the destination node is selected according to LML method. That is the node with the minimum load index in the global load table will be chosen as the destination node. These heuristic load balancing methods perform well as compared with no load balancing. The relative merits of these four methods are discussed below. The LRR and LML methods are based on the locality and have a short migration distance among immediate neighbors. The GRR and GML methods are based on global states and may experience much longer migration distance. The GRR and GML have better performance when the mean service time is long, say around one second, but the LRR and LML methods are better, when the mean service time is small. Since the communication and migration overhead does not aect performance too much when the mean service time is large, it does, however when S is small. Thus, the system which has a lng mean service time can use global methods while the system that has short service time jobs should use local methods. 9.6 Existing Load Balancing Systems 9.6.1 Utopia or Load Sharing Facility (LSF) (experimental/practical system) LSF is a general purpose, dynamic, ecient and scalable load sharing system where heterogeneous hosts are harnessed together into a single system which makes the best use of the available resources. The main properties of this system include its transparency to the user, where the user does not suer from the fact that his/her resources are being shared. In addition, the user does not have to change the programs written to be able to use LSF or any resource that does not run on the user's machine. Draft: v.1, April 26, 1994 9.6. EXISTING LOAD BALANCING SYSTEMS 421 Hierarchies of clusters are managed by master load information managers which collects information and make it available for the hosts in its cluster. In addition, a load information manager (LIM) and remote execution server (RES) reside on each machine and provide the user with a uniform machine independent interface to the operating system. LSF supports running sequential or parallel jobs which are either interactive or batch jobs. The load sharing tool (lstool) allows the users to write their own load sharing applications as shell scripts. In addition, LSF contains an application load sharing library (LSLIB) which users can use in developing their compiled programs and applications. LSF allows new distributed applications to be developed where users can nd the best machines for their jobs using comprehensive load and resource information that is available to them through LSF or by leaving it to LSF to perform the load balancing automatically and transparently for them. In addition, conguration parameters are available to system administrators to control the LSF environment. The system has been used commercially and has proven to be very ecient and useful. V-system The V system uses a state change driven information policy. Each node broadcasts its state whenever its state changes signicantly. State information consists of expected CPU and memory utilization and particulars about the machine itself, such as its processor type etc. The broadcast state information is cached by all the nodes. The V system's selection policy selects only newly arrived tasks for transfer. Its relative transfer policy denes a node as a receiver if it is one of the M most lightly loaded nodes in the system, and as a sender if it is not. The decentralized location policy locates receivers as follows: When a task arrives it consults the local cache and constructs the set containing the M most lightly loaded machines that can satisfy the task's requirement. If the local machine is one of the M machines, then the task is scheduled locally. Otherwise a machine is chosen randomly from the set and is polled to verify the correctness of the cached data. This random selection reduces the chance that multiple machines will select the same remote machine for task execution. If the cached data matches the machine's state (within a degree of accuracy), the polled machine is selected for executing the task. Otherwise, the entry for the polled machine is updated and the selection procedure is repeated. In practice the cache entries are quite accurate, and more than three polls are seldom required. The V System's load index is CPU utilization at a node. To measure CPU utilization, a background process that periodically increments a counter is run at the lowest priority possible. The counter is then polled to see what proportion of the CPU has been idle. Draft: v.1, April 26, 1994 422 CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS Sprite The Sprite system is targeted towards workstation environment. Sprite uses a centralized state change driven information policy. Each workstation, on becoming a receiver, noties a central coordinator process. The location policy is also centralized: To locate a receiver, a workstation contacts the central coordinator process. Sprite's selection policy is primarily manual. Tasks must be chosen by users for remote execution and the workstation on which these tasks reside is identied as a sender. Since the Sprite system is targeted towards an environment in which workstation's are individually owned, it must guarantee the availability of the workstation's resources to the workstation owner. Consequenlty, it evicts foreign tasks from a workstation whenever the owner wishes to use the workstation. During eviction, the selection policy is automatic, and Sprite selects only foreign tasks for eviction. The evicted tasks are returned to their home workstations. In keeping with its selection policy, the transfer policy used in Sprite is not completely automated and it is as follows: 1. A workstation is automatically identied as a sender only when foreign tasks executing at that workstation must be evicted for normal transfers, a node is identied as a sender manually and implicitly when the transfer is requested. 2. Workstations are identied as receivers only for transfers of tasks chosen by the users. A threshold policy decides that a workstation is a receiver when the workstation has no keyboard or mouse input for at least thirty seconds and the number of active tasks is less than the number of processors at the workstation. To promote fair allocation of computing resources, Sprite can evict a foreign process from a workstation to allow the workstation to be allocated to another foreign process under the following conditions: If the central coordinator cannot nd an idle workstation for a remote execution request and it nds that a user has been allocated more than his fair share of workstations, then one of the heavy users process is evicted from a workstation. The evicted process may be transferred elsewhere if idle workstation became available. For a parallelized version of Unix"make", Sprite's designers have observed a speedup factor of ve for a system containing 12 Workstations Condor Condor is concerned with scheduling long running CPU-intensive tasks(background tasks) only. Condor is designed for a workstation environment in which the total availability of a workstation`s resources is guaranteed to the user logged in at the workstation console (the owner). Condor's selection and transfer policies are similar to Sprite's in that most transfer's are manually initiated by users. Unlike sprite, Draft: v.1, April 26, 1994 9.6. EXISTING LOAD BALANCING SYSTEMS 423 however, Condor is centralized, with a workstation designated as a controller. To transfer a task, a user links it with a special system call library and places it in a local queue of background jobs. The controller's duty is to nd idle workstations for these tasks. To accomplish this Condor uses a periodic information policy. The controller polls each workstation at two minute intervals. A workstation is considered idle when only the owner has not been active forat least 12. 5 minutes. The controller queues information about background tasks. If it nds an idle workstation it transfers a background task to that workstation. If a foreign background task is being served at a workstation, a local scheduler at that workstation checks for local activity from that owner every 30 secs, if the owner has been active since the previous check, the local scheduler preempts the foreign task and saves its state. The task may be transferred later to an idle workstation if one is located by the controller. Condor's scheduling scheme provides fair access to computing resources for both heavy and light users. Fair allocation is managed by the "up-down" algorithm, under which the controller maintains an index for each workstation. Initially the indexes are set to zero. They are updated periodically in the following manner: whenever a task submitted by a workstation is assigned to an idle workstation, the index of the submitting workstation is increased. If, on the other hand the task is not assigned to an idle workstation, the index is decreased. The controller periodically checks to see if any new foreign task is waiting for an idle workstation. If a task is waiting, but no idle workstation is available and some foreign task from the lowest priority workstation is running, then that foreign task is preempted and the freed workstation is assigned to the new foreign task. The preempted foreign task is transferred back to the workstation at which it originated. Draft: v.1, April 26, 1994 Bibliography [1] Barak A. and Shiloh A. A distributed load balancing policy for a multicomputer. SoftlDare-Practice and Experience, 15(09):901-913, Sep 1985. [2] Tantawi A.N. and Towsley D. Optimal static load balancing on distributed computing systems. IEEE transactions on Softutare Engineering, 17(2):133-140, Feb 1991. [3] Rommel C.G. The probability of load balancing success in a homogenous network. IEEE transactions on Soft10are Engineering, 17(09):922-933, Sep 1991. [4] Zahorjan J. Eager D.L, Lazowska E.D. Adaptive load sharing in homogenous distributed systems. IEEE transactions on Softloare Engineering, 12(05):662675, May 1986. [5] Chao-Wei Ou, Sanjay Ranka, and Georey Fox. Fast mapping and remapping algorithm for irregular and adaptive problems. Technical report, July 1993. [6] A. Pothen, H. Simon and K-P Liou 1990. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11, 3(July), 430-452. [7] B. Hendrickson and R. Leland 1992. An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations. Sandia National Labs. SAND921460. [8] Nashat Mansour. Physical Optimization Algorithms for Mapping Data to Distributed-Memory Multiprocessors. PhD Thesis. [9] Mansour, N., and Fox, G.C. 1991. A hybrid genetic algorithm for task allocation. Proc. Int. Conf. Genetic Algorithms (July), 466-473. [10] R. D. Williams 1991. Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurrency: Practice and Experience, 3(5), 457-481 425 426 BIBLIOGRAPHY [11] Rahul Bhargava, Virinder Singh and Sanjay Ranka. A Modied Mean Field Annealing Algorithm for Task Graph Partitioning. Technical Report, under preparation. [12] Douglis F and Ousterhout J. Transparent process migration:design alternatives and the sprite implementation. Software - Practice and Experience, 21(08):757785, Aug 1991. [13] Livny M. and Melman M. Load balancing in homogenous broadcast distributed systems. Proceedings of the ACM Computer Network Performance Symposium, 11(01):47-55, Apr 1982. [14] Kreimien O. and Kramer J. Methodical analysis of adaptive load sharing algorithms. IEEE Transactions on Parallel and Distributed Systems., 3(06):747-760, Nov 1992. [15] Krueger P. and Livny M. The diverse objectives of distributed scheduling policies. Proc. of Seventh Int'l Comf. on Distributed Computing Systems, 801:242249, 1987. [16] Krueger P. and Finkel R. An adaptive load balancing algorithm for a multicomputer. CS Dept.University of Wisconsin,Madison,Technical Report, 539, Apr 1984. [17] Zhou S. and Ferrari D. A trace driven simulation study of dynamic load balancing. IEEE Transactions on Software Engineering, 14(09):1327-1341, Sep 1988. [18] Bokhari S.H. Dual processor scheduling with dynamic reassignment. IEEE transactions on Softloare Engineering, 05(07):341-349, July 1979. [19] Kruegger P Shivratri N. and Singhal M. Load distributing for locally distributed systems. IEEE Computer, pages 33-44, Dec 1992. [20] Kunz T. The inuence of dierent workload descriptions on a heuristic load balancing scheme. IEEE transactions on Soft are Engineering, 17(07):725-730, Jull991. [21] Casavant T.L. and Kuhl J.G. Eects of response and stability on scheduling in distributed computing systems. IEEE transactions on Softu are Engineering, 14(11):1578-1587, Nov 1988. [22] Casavant T.L. and Kuhl J.G. A taxanomy of scheduling in general-purpose distributed computing systems. IEEE transactions on Soft uare Engineering, 14(02):141-154, Feb 1988. Draft: v.1, April 26, 1994 BIBLIOGRAPHY 427 [23] C. Gary Rommel, "The Probability ot Load Balancing Success in a Homogeneous Network", IEEE Transactions on Software Engineering, vol 17, pp. 922. [24] K.W. Ross and D.D. Yao, "Optimal Load Balancing and Scheduling in a Distributed Computer System", Journal of the Association for Computing Machinery, vol 3X, pp. 676. [25] R. K. Boel and J.H. Van Schuppen, "Dislributed Routing for Load Balancing", Proceeding of the IEEE, vol 77, PE)- 212. [26] A. Hac and T. J. Johnson, "Dynamic Load Balancing Through Process and Read-Site Placement in a Distrihuted System", AT&T Technical Journal, Sepl/Oct 1988, pp. 72. [27] M. J. Berger and S. A. Bokhari, " A partitioning Strategy for non-uniform Problems on Multiprocessors", IEEE Transactions on Computers, pp. 570-580, C-26 1987. [28] D. P. Bertsekas and J. N. Tsitsiklis, "Parallel and Distributed Algorithms", Prentice-Hall Englewood Clis, NJ, 1989. [29] Timothy Chou and Jacob Abraham, "Distributed Control of Computer Systems", IEEE Transactions on Computers, pp. 564-567, June 1986. [30] George Cybenko, "Dynamic Load Balancing for Distributed Memory Multiprocessors", Journal of Parallel and Distributed Computing, pp. 279-301, October 1989. [31] Kemal Efe and Bojan Groslej, "Minimizing Control Overheads in Adaptive Load Sharing", IEEE Internal Conference on Distributed Computing, pp. 307315, 1989. [32] Anna Hac and Theodore Johnson, "A Study of Dynamic Load Balancing in a Distributed System", ACM, pp. 348-356, Febreary 1986. [33] A. J. Harget and I. D. Johnson, "Load Balancing Algorithms in Loosely-Coupled Distributed Systems: a Survey", Distributed Computer Systems, pp. 85-107, [34] F. Lin and R. Keller, "Gradient model: A Demand-driven Load Balancing Algorithm", IEEE Proceedings of 6th Conference of distributed Compudng, pp. 329336, August 1986. [35] Lionel Ni and Kai Hwang, "Optimal Load Balancing in a Multiple Processor System with Many Job Classes", IEEE Transactions on Software Engineering, pp. 491-496, May 1985. Draft: v.1, April 26, 1994 428 BIBLIOGRAPHY [36] E. de Souza e Silva and M. Gerla, "Load balancing in Distributed Systems with Multiple Classes and Site Constraints:, Proc. Performance, pp. 17-34, 1984. [37] Yang-Terng Wang and Robert Morris, "Load Sharing in Distributed Systems", IEEE Transactions on Computers, pp. 204-217, March 1985. [38] A. N. Tantawi and D. Towsley, "Optimal Static Load Balancing in Distributed Computer Systems", pp. 445-465, Journal of the ACM, 32, No. 2. [39] Jian Xu and Kai Hwang, "Heuristic Methodes for Dynamic Load Balancing in A MessagePassing Supercomputer", IEEE, pp. 888-897, May 1990. [40] J.Xu and K. Hwang, " Heudstic Methods for Dynamic Load Balancing in A Message-Passing Supercomputer" [41] G. Skinner, J.M. Wrabetz, and L. Schreier, "Resiurce Management In a Distributed Intemetwork Environment". [42] A. Hac and T. J. Johnson, "A Study of Dynamic Load Balancing in a Distributed System", [43] T.L. Casavant and J. G. Kuhl, "A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems" Draft: v.1, April 26, 1994 Chapter 6 Distributed File Systems Chapter Objectives A file system is a subsystem of an operating system whose purpose is to organize, retrieve, store and allow sharing of data files. A distributed file system is a distributed implementation of the classical time-sharing model of a file system, where multiple users who are geographically dispersed share files and storage resources. Accordingly, the file service activity in a distributed system has to be carried out across the network, and instead of a single centralized data repository there are multiple and independent storage devices. The objectives of this chapter are to study the design issues and the different implementations of distributed file systems. In addition, we give an overview of the architecture and implementation techniques of some well known distributed file systems such as SUN Network File System (NFS), Andrew File System, and Coda. Keywords: NFS, AFS, CACHE, FILE CASHING, TRANSPERANCY, CONCURENCY CONTROL, LOCUS, ETC… 6.1 Introduction The file system is part of an operating system that provides the computing system with the ability to permanently store, retrieve, share, and manipulate stored data. In addition, the file system might provide other important features such as automatic backup and recovery, user mobility, and diskless workstations. The file system can be viewed as a system that provides users (clients) with a set of services. A service is a software entity running on a single machine [levy, 1990]. A server is the machine that runs the service. Consequently, the file system service is accessed by clients or users through a welldefined set of file operations (e.g., create, delete, read, and write). The server is the computer system and its storage devices (disks and tapes) on which files are stored and from which they are retrieved according to the client requests. The UNIX time-sharing is an example of a conventional centralized file system. A Distributed File System (DFS) is a distributed implementation of the traditional time-sharing model of a file system that enable users to store and access remote files in a similar way to local files. Consequently, the clients, servers, and storage devices of a distributed file system are geographically dispersed among the machines of a distributed system. The file system design issues have experienced changes similar to the changes observed in operating system design issues. These changes were mainly with respect to the number of processes and users that can be supported by the system. Based on the number of processes and users, file systems can be classified into four types [Mullender, 1990]: 1) single-user/single-process file system; 2) single-user/multiple-processes file system; 3) multiple-users/multiple-processes centralized time-sharing file system; and 4) multipleusers/multiple-processes geographically distributed file system. The design issues in single-user/single-process file system include how to name files, how to allocate files to physical storage, how to perform file operations, and how to maintain the file system consistency against hardware and software failures. When we move to a single-user/multiple-processes file system, we need to address concurrency control issues and how to detect and avoid deadlock situations that result from sharing resources. These issues become even more involved when we move to a multiple-users/multiple-processes file system. In this system, we do need to address all the issues related to multiple concurrent processes as well as those related to how to protect and secure user processes (security). The main security issues include user identification and authentication. In the most general type (multiple-users/multipleprocesses geographically distributed file system), the file system is implemented using a set of geographically dispersed file servers. The design issues here are more complex and challenging because the servers, clients, and network(s) that connect them are typically heterogeneous and operate asynchronously. In this type that we refer to as a distributed file system, the file system services need to have access and name transparency, fault tolerance, highly available, secure, and high performance. The design of a distributed file system that supports all of these features in an efficient and a cost-effective way is a challenging research problem.. 6.2 File System Characteristics and Requirements The client applications and their file system requirements vary from one application to another. Some applications run on only one type of computers, others run on a cluster of computers. Each application type demands different requirements on the file system. One could characterize the applications requirements for a file system in terms of the file system role, file access granularity, file type, protection, fault-tolerance and recovery [svobodova]. • File System Role. The file system role can be viewed in terms of two extremes: 1) Storing Device, and 2) Full-scale filing system. The Storing device appears to the users as a virtual disk that is mainly concerned with storage allocation, maintenance of data objects on storage medium, and data transfer between the network and the medium. The full-scale filing system provides all the services offered by the storing device and additional functions such as controlling concurrent file accesses, protecting user files, enforcing required access control, and directory service that maps textual file names into file identifiers recognized by the system software. • File Access Granularity. There are three main granularities to access data from the file system: 1) File-Level Storage and Retrieval, 2) Page (block)-Level Access, and 3) • • • • • Byte-Level Access. The file access granularity affects significantly the latency and sustainable file access rate. The appropriate access granularity depends on the type of applications and system requirements; some applications need the file system to support a bulk transfer of the whole files and thus the appropriate access mode is the file-level access mode, while other applications require efficient random access to small parts within a file and thus the appropriate access mode could be byte-level. File Access Type. The file access mode can be broadly defined in terms of two access models: Upload/download model and Remote Access. In the Upload/download Model, when a client reads a file, the entire file is moved to the client’s host before the read operation is carried out. At the end when the client is done with the file, it is sent back to the server for storage. The advantage of this approach is that once the file is loaded in the client’s memory, all file accesses are performed locally without the need to access the network. However, the disadvantages of this approach are two folds: 1) increasing the load on the network due to downloading/uploading entire files, and 2) the client computer might not have enough memory to hold large files. Consequently, this approach limits the size of the files that can be accessed. Furthermore, experimental results showed that the majority of file accesses read few bytes and then close the file; the life cycle of most files is within a few seconds [zip parallel file system]. In the Remote Access Model, each file access is carried out by a request sent through the network to the appropriate server. The advantages of this approach are 1) the users do not need to have large local storage in order to access the required files and 2) the messages are small and can be handled efficiently by the network. Transparency. Ideally, a distributed file system (DFS) should look to its clients like a conventional, centralized file system. That is, the multiplicity and dispersion of servers and storage devices should be transparent to the clients. Transparency measures the system ability to hide the geographic separation and heterogeneity of resources from the user and the application programmer so that the system is perceived as a whole rather as a collection of independent resources. The cost of implementing full transparency is prohibitive and challenging. Instead several types of transparencies, that are less transparent than full transparency, have been introduced such as network transparency and mobile transparency. Network transparency: Network transparency allows clients/users to access remote files using the same set of operations used to access local files. That means accessing remote and local files become indistinguishable by users and applications. However, the time it takes to access remote files is longer because of the network delay. Mobile Transparency: This transparency defines the ability of the distributed file system to allow users to log in to any machine available in the system, regardless of the users locations; that means the system does not force users to login to specific machines. This transparency facilitates user mobility by bringing the users' environment (e.g., home directory) to wherever they log in. Performance. In centralized file system, the time it takes to access a file depends on the disk access time and the CPU processing time. In a distributed file system, a remote file access has two more factors to be considered: the time it takes to transfer and process the file request at the remote server and the time it takes to deliver the requested data from the server to the client/user. Furthermore, there is also the • • overhead associated with running the communication protocol on client and server computers. The performance of a distributed file system can be interpreted as another dimension of its transparency; the performance of remote file access should be comparable to local file access [levy, 1990]. Fault-tolerance. A distributed file system is considered fault-tolerance if it can continue providing its services in a degraded mode when one or more of its components experience failures. The failure could be due to communication faults, machine failures, storage device crashes, and decays of storage media. The degradation can be in performance, functionality, or both. The fault-tolerance property is achieved by using redundant resources and complex data structures (transactions). In addition to redundancy, atomic operations and immutable files (the files that can only be read but not written) have been used to guarantee the integrity of the file system and facilitate fault recovery. Scalability. It measures the ability of the system to adapt to increased load and/or to addition of heterogeneous resources. The distributed file system performance should degrade gracefully (moderately) as the system and network traffic increases. In addition, the addition of new resources should be smooth with little overhead (e.g., adding new machines should not clog the network and increase file access time). It is important to emphasize that the distribution property of a distributed file system that makes the system fault-tolerance and scalable due to the inherent multiple redundant resources in a distributed system. Furthermore, the geographic dispersion of the system resources and its activities must be hidden from the users and made transparent. Because of these characteristics, the design of a distributed file system is more complicated than the file system for a single processor system. Consequently, the main issues to be emphasized in this chapter are related to transparency, fault-tolerance, and scalability. 6.3 File Model And Organization The file model addresses the issues related to how the file should be represented and what types of operations can be performed on the file. The types of files range from unstructured byte stream files to highly structured files. For example, a file can be structured as a sequence of records or just simply a sequence of byte streams. Consequently, different file systems have different models and operations that can be performed on their files. Some file systems provide a single file model, such as a byte stream as in the UNIX file system. Other file systems provide several file types (e.g., Indexed Sequential Access Memory (ISAM) and record in the VMS file system). Hence, a bitmap image would be stored as a sequence of bytes in the UNIX file system while it might be stored as one or two record file in the VMS file system. The organization of the file systems can be described in terms of three modules or services (see Figure 6.1): 1) Directory Service, 2) File Service, and 3) Block Service. These services can be implemented as individually independent co-operative components or all integrated into one software component. In what follow, we review the design issues related to each of these three modules. Client computer Application program Application program Server computer Directory service Flat file service Client module Figure 6.1. File service architecture 6.3.1 Directory Service The naming and mapping of files are provided by the directory service, which is an abstraction used for mapping between text names and file identifiers. The structure of the directory module is system dependence. Some systems combine directory and file services into a single server that handles all the directory operations and file calls. Others keep them separate, hence opening a file requires going to the directory server to map its symbolic name onto its binary name and then passing the binary name to the file server to actually read or write the file. In addition to naming files, the directory service controls file access using two techniques: capability-based and identity-based techniques. 1. Capability-based approach: It is based on using a reference or name that acts as a token or a key to access each file. A process is allowed to access a file when it possesses the appropriate capability. For example, a Unique File Identifier (UFID) can be used as a key or capability to maintain the protection against any unauthorized access. 2. Identity-based approach: This approach requires each file to have an associated list that shows the users and operations that are entitled to perform on that file. The file server checks the identity of each entity requesting a file access to determine if the requested file operation can be granted based on the user's access rights. 6.3.2 File Service The file service is concerned with issues related to file operations, access modes, file state and how it is maintained during file operations, file caching techniques, file sharing, and file replication. • File Operation: The file operations include open, close, read, write, delete, etc. The properties of these operations define and characterize the file service type. Certain applications might require all file operations to be atomic; an atomic operation is the one when it is completed it is guaranteed to be performed successfully from the beginning until the end, otherwise the operation is aborted without any side effect on the system state. Other applications require files to be immutable; that means files can be modified once they are created and therefore the set of allowed operations include read, delete but no write operation. Immutable files are typically used to simplify the recovery from faults and consistency algorithms. • File State: This issue is concerned whether or not file, directory, and other servers should maintain state information about clients. Based on this issue, the file service can be classified into two types: Statefull and Stateless File Service. 1. Statefull File Service: A file server is statefull when it keeps information on its client state and then uses this information to process client file requests. This service can be characterized by the existence of a virtual circuit between the client and the server during a file access session. The advantage of statefull service is performance; file information is cached in main memory and can be easily accessed, thereby saving disk accesses. The disadvantage of statefull service is that state information will be lost when the server crashes and this complicates the fault recovery of the file server. 2. Stateless Server: The server is stateless when the server does not maintain any information on the client once it finished processing its file requests. A stateless server avoids the state information by making each request self-contained. That is, each request identifies the file and the position of the read/write operations. That is there is no need to for Open/Close operations that are required in a statefull file service. The distinction between statefull and stateless service becomes evident when considering the effects of a crash during a service activity. When the server crashes, a statefull server usually restores its state by a following an appropriate recovery protocol. A stateless server avoids this problem altogether by making any file request be a self-contained and thus can be processed by any new reincarnated server. In other circumstances when the client failure, the statefull server needs to become aware of such failures to reclaim the space allocated to record the state of the crashed clients. On the contrary, no obsolete state needs to be cleaned up on the server side in a stateless file service. The penalty for using the stateless file service is longer request messages and slower processing of file requests since there is no in-core information to speed up the handling of these requests. The file service issues related to caching, replication, sharing, and tolerance will be discussed in further detail next. 6.3.3 Block Service The block service addresses the issues related to disk block operations and the allocation techniques. The block operations can be implemented either as a software module embedded within the file service or as a separate service. In some systems a network disk server (e.g. Sun UNIX operating system) provides access to a remote disk block for swapping and paging by diskless workstations. Separating the block service from the file service offers two advantages: 1) Separate the implementation of the file service from disk specific optimizations and other hardware concerns. This allows the file service to use a variety of disks and other storage media; and 2) Support several different file systems that can be implemented using the same underlying block service. 6.4 Naming and Transparency In a DFS, a user refers to a file by a textual name. Naming is the means of mapping textual names (logical names) to physical devices. There is a multilevel mapping from this name to the actual blocks in a disk at some location, which hides from the user the details of how and where the file is located in the network. Furthermore, to improve file system availability and fault-tolerance, the files can be stored on multiple file servers. In this case, the mapping between a logical file name and the actual physical name returns multiple physical locations that contain a replica of the logical file name. The mapping task in a distributed file system is more complicated than a centralized file system because of the geographic dispersion of file servers and storage devices. The most naive approach to name files is to append the local file name to the host name at which the file is stored as is done in the VMS operating system. This scheme guarantees that all files have unique names even without consulting other file servers. However, the main disadvantage of this approach is that files can not be migrated from one system to another; if the file must be migrated, its name needs to be changed and furthermore all the file users must be notified. In this subsection, we will discuss transparency issues related to naming files, naming techniques and implementation issues. 6.4.1 Transparency Support Ideally, a distributed file system (DFS) should look to its clients like a conventional, centralized file system. That is, the multiplicity and dispersion of servers and storage devices should be transparent to the clients. Transparency measures the system ability to hide the geographic separation and heterogeneity of resources from the user and the application programmer, so that the system is perceived as a whole rather as a collection of independent resources. The cost of implementing full transparency is prohibitive and challenging. Instead several types of transparencies, that are less transparent than full transparency, have been introduced. We discuss transparency issues that attempt to hide location, network, and mobility. In addition, we address the transparency issues related to file names and how to interpret them in a distributed computing environment. • Naming transparency: Naming is a mapping between logical and physical objects. Usually, a user refers to a file by a textual name. The latter is mapped to a lower-level numerical identifier, which in turn is mapped to disk blocks. This multilevel mapping provides users with an abstraction of file that hides the detail of how and where the file is actually stored on the disk. In a transparent DFS, anew dimension is added to the abstraction, that of hiding where in the network the file is located. The naming transparency can be interpreted into two notions: 1. Storage location. 2. Location Independence. The name of a file needs not be changed when the file's physical storage location changes. A Location-independent naming scheme is a dynamic mapping, since it can map the same file name to different locations at different instances of time. Therefore, location independence is a stronger property than location Transparency. When referring to the location independent, one implicitly assumes that the movement of files is totally transparent to users. That is, files are migrated by the system without the users being aware of it. In practice, most of the current file systems (e.g., Locus, Sprite) provide a static, location transparent mapping for user-level names. Only Andrew and some experimental file systems support location independence. • • Network transparency: Clients need to access remote files using the same set of commands used to access local files; that means there is no difference in the commands used to access local and remote files. However, the time it takes to access remote files will be longer because of the network delay. Network transparency hides the differences between accessing remote and local files so they become indistinguishable by users and applications. Mobile Transparency: This transparency defines the ability of the distributed file system to allow users to log in to any machine available in the system, regardless of the users locations; that means the system does not force users to login to specific machines. This transparency facilitates user mobility by bringing the users' environment to wherever they log in. The file names should not reveal any information about the location of the files and furthermore their names should not be changed when the files are moved from one storage location to another. Consequently, we can define two types of naming transparencies: Location Transparency and Location Independence. • Location Transparency. The name of a file does not reveal any information about its physical storage location. In location transparency, the file name is statically mapped into a set of physical disk blocks, though hidden from the users. It provides users with the ability to share remote files as if they were local. However, sharing the storage is complicated because the file name is statically mapped to certain physical storage devices. Most of the current file systems (e.g., NFS, Locus, Sprite) provide location transparent mapping for file names [levy, 1990]. • Location Independence. The name of a file needs not be changed when it is required to change the file’s physical location. A Location-independent naming scheme can be viewed as a dynamic mapping, since it can map the same file name to different locations at different instances of time. Therefore, location independence is a stronger property than location transparency. When referring to the location independent, one implicitly is that the movement of files is totally transparent to users. That is, files are migrated by the system to improve the overall performance of distributed file system by balancing the loads on its file servers without the users being aware of it. Only few distributed file systems supported location independence (e.g., Andrew and some experimental file systems). 6.4.2 Implementation Mechanisms We will review the main techniques used to implement naming schemes in distributed file systems such as pathname translation, mounting scheme, unique file identifier, and hints. • Pathname Translation: In this method, any logical file defined as a path name (e.g., /level1/level2/filename) and it is translated by recursively looking up the low-level identifier for each directory in the path, starting from the root (/). If the identifier indicates that the sub-directory (or the file) is located in another machine, the lookup procedure is forwarded to that machine. This is done till the identity of the machine that stores the requested file has been identified. This machine then returns to the client who requested the file the low-level identifier of that file (filename) in the machine's file system. In some DFS, such as NFS and Sprite the request for the file name lookup is passed on from one server to another server until the server that stores the requested file is found. In Andrew file system, each state of the lookup procedure is performed by the client. This option is scalable because the server is relieved from performing the lookup procedure needed to translate client file access requests. • Mount Mechanisms: This scheme provides means to attach remote file systems (or directories) to a local name space, via a mount mechanism, as in Sun's NFS. Once a remote directory is mounted, its files can be named independent of the file’s location. This approach enables all the machines on the network to specify the part of the file name space (such as executables and home directories of users) that can be shared with other machines and at the same time keeps machine specific directories local. Consequently, each user can access local and remote files according to its naming tree. However, this tree might be different from one computer to another and thus accessing any file is not independent of the location of file requests. • Unique File Identifier. In this approach, we have a single global name space that is visible to machines and it spans all the files in the system. Consequently, all files are accessed using a single global name space. This approach assigns each file a Unique File Identification (UFID) that is used to address any file in the system regardless of its location. In this method, any file is associated with a component unit. All files in a component unit are located at the same storage. Any file name is translated into a UFID that has two fields. The first field contains the component unit to which the file belongs. The second field is a low-level identifier into the file system of that component unit. At run time, we maintain a table indicating the physical location of each component unit. Note that this method is truly location independent, since we associate files with component units whose actual location is unspecified, except at bind-time. There are a number of ways to ensure the uniqueness of the UFIDs associated with different files. [Needham and Herbert, 1982; Mullender, 1985; Leach, 1983] all gave importance on using relatively large sparsely populated space to generate UFIDs. To achieve the uniqueness in UFIDs we can concatenate a number of identifying numbers and/or a random number to ensure further security. This can be done by concatenating the host address of the server creating the file with a file representing the position of the UFID in the chronological sequence of UFIDs created by that server. An extra field containing a random number is embedded into each UFID in order to combat any possible attempt of counterfeiting. This ensures the distribution of the valid UFIDs is sparse; also the length of the UFID is so long, that makes unauthorized access practically impossible. Figure 6.2 shows the format of a UFIDs that is represented as a 12-bit records. UFID 1 2 3 FCB 23422 5465 65842 location on disk, size,.. … … Figure 6.2. This figure shows the UFID with the long identification number to avoid unauthorized access In the above format, the server identifier is considered to be an internet address, ensuring uniqueness across all registered internet-based systems. Access control to file is based upon the fact that a UFID constitutes a 'key' or capability to access a file. Actually access control is a matter of denying UFIDs to unauthorized clients. When a file is shared over a group, then the owner of the file holds all the rights on the file i.e. he/she can perform all types of operations, whether it is read, write, delete, truncate etc. But the other members of the group hold lesser rights on the file e.g. they can only read the file, but they are not authorized to perform the other operations. Most refined form of access control can be done by embedding a permission field in the UFID, which encodes the access rights that the UFID confers upon its processor. The permission field must be combined with the random part of before giving the UFID to users. The use of the permission field should be done carefully so that its not easily accessible. Otherwise the access rights can easily be changed, e.g. from read to write etc. Whenever a file is created, the UFID is returned to the owner (creator) of the file. When the owner gives the access rights to other users, it is necessary that some rights are taken away to restrict the capabilities of the other users, by a function in the file server meant for restricting capabilities. The two different ways suggested in [Coulouris and Dolimore, 1988] by which a file server may hide its permission field is as follows: 1. The permission field and the random part are encrypted with a secret key issued to clients. When client present UFIDs for file access, the file server uses the secret key to decrypt them. 2. The file server may encrypt the two fields by using a one-way function to produce UFIDs issued to clients. When clients present the UFIDs for file access, the file server applies the one-way function to its copy of the UFID and compares the result with the client's UFID. • Hints: Hint is a technique often used for quick translation of file names. A hint is a piece of information that directly gives the location of the requested file and thus it speeds up performance if it is correct. However, it does not cause any semantically negative effects if it is incorrect. Since looking up path names is a time consuming procedure, especially if multiple directory servers are involved. Some system attempts to improve their performance by maintaining a cache of hints. When a file is opened, the cache is checked to see if the path name is there. If so, the directory-bydirectory lookup is skipped and the binary address is taken from the cache. 6.5 File Sharing Semantics The file semantics of sharing are important criteria for evaluating any file system that allows multiple clients to share files. When two or more users share the same file, it is necessary to define the semantics of reading and writing to avoid problems such as data inconsistency or deadlock. The most common types of sharing semantics are 1) Unix Semantics, 2) Session Semantics, 3) Immutable Shared Semantics, and 4) Transaction Semantics. 6.5.1 UNIX Semantics This semantics makes writes to an open file by a client are visible immediately to other (possibly remote) clients who have had this file open at the same time. When a READ operation follows a WRITE operation, the READ returns the value just written. Similarly, when two WRITEs happen in quick succession, followed by a READ, the value read is the value stored by the last write. It's possible for clients to share the pointer to the current file location. Thus advancing the pointer by one client affects all sharing clients. The system enforces an absolute time ordering on all operations and always returns the most recent value. In a distributed system, UNIX semantics can be easily achieved as long as there is only one file server and clients do not cache files; all READs and WRITEs go directly to the file server, which processes them strictly sequentially. This approach gives UNIX semantics. The sharing of the location pointer is needed primarily for compatibility of the distributed UNIX system with conventional UNIX software. In practice, the performance of a distributed system in which all file requests must be processed by a single server is frequently poor. This problem is often solved by allowing clients to maintain local copies of heavily used files in their private caches. 6.5.2 Session Semantics This semantics makes Writes to an opened file visible immediately to local clients but invisible to remote clients who have opened the same file simultaneously. Once a file is closed, the changes made to it are visible only in later starting sessions. However, these changes will not be reflected in the already opened instances of the file. Using session semantics raises the question of what happens if two or more clients are simultaneously caching and modifying the same file. When each file is closed, its value is sent back to the server, the client that closes the last will overwrite the pervious write operations and thus the updates of the previous clients will be lost. A good example for that are the yellow pages. Every year, the phone company produces one Telephone book that lists the business and customers’ numbers. It is a database with certain information that is updated once a year. The granularity is on annual basis. The yellow pages are not accurate during the year because it is updated at the end of the session. Accuracy in such examples is not a big issue because most of the customers will search for a business in the area not for a certain business. The application, in this case, relies on simplicity with a sacrifice of the accuracy. 6.5.3 Immutable Shared File Semantics This semantics states that a file can be opened and read only. That is, once a file is created and declared as a shared file by the creator, it cannot be modified any more. Clients cannot open a file for writing. What a client can do if it has to modify a file is to create an entirely new file and enter it into the directory under the name of a previous existing file, which now becomes inaccessible. Just like session semantics, when two processes try to replace the same file at the same time, either the latest one or nondeterministically one of them will be chosen to be the new file. This approach makes file implementation quite simple since the sharing is in read-only mode. 6.5.4 Transactions Semantics This semantics indicates that the operations on a file or a group of files will be performed indivisibly. This is done by having the process declares the beginning of the transaction using some type of BEGIN TRANSACTION primitive; this signals that what follows must be executed indivisibly. When the work has been completed, an END TRANSACTION primitive is executed. The key property of this semantics is that the system guarantees that all the calls contained within a transaction will be carried out in order, without any interference from other concurrent transactions. If two or more transactions start up at the same time, the system ensures that the final result is the same as if they were all run in some (undefined) sequential order. 6.6 Fault Tolerance And Recovery Fault tolerance is an important attribute in distributed system that can be supported because of the inherent multiplicity of resources. There are many methods to improve the fault tolerance of a DFS. Improving availability and the use of redundant resources are two common techniques to improve the fault tolerance of a DFS. 6.6.1 Improving Availability A file is called available if it can be accesses whenever needed. Availability is a fragile and unstable property. It varies as the system's state changes. On the other hand, it is relative to a client; for one client a file may be available, whereas for another client on a different machine, the same file may be unavailable. Before discuss the availability of a file, we define two file properties first: "A file is recoverable if is possible to revert it to an earlier, consistent state when operation on the file fails or is aborted by the client. A file is called robust if it is guaranteed to survive crashes of the storage device and decays of the storage medium." A file is called available if it can be accessed whenever needed, despite machine and storage device crashes and communication faults. Availability is often confused with robustness, probably they both can be implemented by redundancy techniques. A robust file is guaranteed to survive failures, but it may not be available until the fault component has recovered. Availability is a fragile and unstable property. First, it is temporal; Availability varies as the system's state changes. Also, it is relative to a client; for one client a file may be available, whereas for another client on a different machine, the same file may be unavailable. Replicating files can enhance the availability [Thompson, 1931] , however, merely replicating file is not sufficient. There are some principles destined to ensure increased availability of the files described below. • • • • The number of machines involved in a file operation should be minimal, since the probability of failure grows with the number of involved parties. Once a file has been located there is no reason to involve machines other than the client and the server machines. Identifying the server that stores the file and establishing the client-server connection is more problematic. A file location mechanism is an important factor in determining the availability of files. Traditionally, locating a file is done by a pathname traversal, which in a DFS may cross machine boundaries several times and hence involve more than two machines [Thompson, 1931]. In principle, most systems, e.g., Locus, Andrew, approach the problem by requiring that each component, i.e., directory, in the pathname would be looked up directly by the client. Therefore, when machine boundaries are crossed, the serve in the client-server pair changes, but the client the same. If a file is located by pathname traversal, the availability of a file depends on the availability of all the directories in its pathname. A situation can arise whereby a file might be available to reading and writing clients, but it cannot be located by new clients since a directory in its pathname is unavailable. Replicating top-level directories can partially rectify the problem and is indeed used in Locus to increase the availability of files. Caching directory information can both speed up the pathname traversal and avoid the problem of unavailable directories in the pathname (i.e., if caching occurs before the directory in the pathname becomes unavailable). Andrew uses this technique. A better mechanism is used by Sprite. In Sprite, machines maintain prefix tables that map prefixes of pathnames to the servers that store the corresponding component units. Once a file in some component unit is open, all subsequent Opens of files within that same unit address, the right server directly, without intermediate lookups at other servers. This mechanism is faster and guarantees better availability. 6.6.2 File Replication Replication of files is a useful scheme for improving availability, reducing communication traffic in a distributed system and improving response time. The replication schemes can be classified into three categories: the primary-stand-by, the modular redundancy, and the weighted voting [Yap, Jalote and Tripathi, 1988] and [Bloch, Daniels and Spector, 1987]. • • Primary-stand-by: It selects one copy from the replica and designates it as the primary copy, whereas the others are standbys. Then all subsequent requests are sent to the primary copy only. The stand-by copies are not responsible for the service, and they are only synchronized with the primary copy periodically. In case of failure, one of the standbys copies will be selected as the new primary copy, and the service goes on. Modular Redundancy: This approach makes no distinction between the primary copy and standby ones. Requests are sent to all the replica simultaneously, and these requests are performed by all copies. Therefore, a file request can be processed regardless of failures in networks and servers provided that there exists at least one accessible correct copy. This approach is costly to maintain the synchronization • between the replica. When the number of replica increases, the availability decreases, since any update operation will lock all the replica. Weighted Voting: In this scheme, all replica of a file, called representatives, are assigned a certain number of votes. Accesses operations are performed on a collection of representatives, called access quorum. Any access quorum which has a majority of the total votes of all the representatives is allowed to perform the access operation. Such a scheme has the maximum flexibility where the size of the access quorum can change for various conditions. On the other hand, it may be too complicated to be feasible in most practical implementation. A variant model [Chung, 1991] which combines the modular redundancy and primarystand-by approaches provides more flexibility with respect to system configuration. This model divides all copies of a file into several partitions. Each partition functions as a modular redundancy unit. One partition is selected as primary and the other partiions are backups. In this manner, it reaches the balance in the trade-off between the modular redundancy and primary-stand-by approaches. An important issue in file replication is how to determine the file replication level and the allocation of the replicated file copy necessary to achieve satisfactory system performance. There are three strategies in solving the file allocation problem (FAP). • Static File Allocation: It is intuitive that the replica are firmly allocated in specified sites. Based on the assessment of file access activity levels, cost and system parameter values, the problem involves allocating file copies over a set of geographically dispersed computer so as to optimize an optimality criterion, while satisfying a variety of system constrains. Static file allocations are for systems which have a stable level of file access intensities. The optimality objectives used in past include system operating costs, transaction response time and system throughput. Essentially, static file allocation problems are formulated as combinatorial optimization models where the goal addresses the allocation tradeoffs in term of the selected optimality criterion. Investigations of static file allocation problems were pioneered by W.W. Chu [Chu, 1969]. Since the FAP is NP-complete, much attention has been given to the development of heuristics that can generate good allocations with lower computational complexity. Branch-and-bound and graph searching methods are the typical solution techniques to avoid enumeration of the entire solution space. • Dynamic File Allocation: If file system is characterized with high variability in usage patterns, the use of static file allocation will degrade the performance and the cost increases throughout the operational period. Dynamic file allocation is based on anticipated changes in the file access intensities. Of course, the file reallocation costs incurred in this scheme have to be taken into consideration in the initial design process. The dynamic file allocation problem is one of determining file reallocation policies over time. File reallocations involve simultaneous creation, relocation and deletion of file copies. Dynamic file allocation models can be classified as nonadaptive and adaptive. Initial research focused on non-adaptive models, while more recent studies have concentrated on adaptive policies. In most recent research of adaptive model on the dynamic FAP yielding lower computational complexity, it is achieved by restricting the reallocations to single-file reallocations only. To improve the applicability of the research results of dynamic FAPs, it is necessary to study the problem structure under realistic schemes for file relocations in conjunction with effective control mechanisms and to develop specialized heuristics for practical implementations. • File Migration: This is also referred to as file mobility or location independence. The main difference between dynamic FAP and file migration is in the operations used to change a file assignment. Dynamic file reallocation models rely heavily on prior usage patterns of the system database. File migrations are not very sensitive to prior estimates on systems usage patterns. They automatically react to temporary changes in access intensity by making the necessary adjustments in file locations without human management or operator intervention. Dynamic FAP considers file reallocations that might involve reallocating multiple replica. These major changes could result in system-wide interruptions of information services. In the file migration problem, each migration operation deals with only a single file copy. Evaluation of file migration policies has been investigated by several researchers [Gavish, 1990]. Since the file migration dealing with only a single file copy, an individual file migration operation might be less effective than a complete file reallocation in improving system performance. However, selecting an optimal or nearoptimal single operation is less complex than determining complete file reallocations. Therefore, file migration can be invoked more frequency, thereby responding to system changes more rapidly than file reallocation. 6.6.3 Recoverability A file server should always ensure that the files it holds are always be accessible after the failure of the system. The effect of failure in distributed system is much more pervasive with respect to its centralized counterpart due to the fact that the clients and the servers may fail independently and therefore there is a greater need of designing a server computer that can restore data after the system failure and save the same from permanent loss. In both the conventional and distributed system, disk hardware and driver software can be designed to ensure that if the system crashes during a block write operation, or a data transfer occurring during a block transfer, partially written data or incorrect data are detected. In XDFS the use of stable storage is worth mentioning here. It is a redundant storage for structural information, which is implemented as a separate abstraction provided by the block service operation. It is basically a means to revive data from permanent loss after a system failure during a disk write operation or damage to any single disk block. Operations on stable blocks are implemented using two disk blocks which holds the content of each stable block in duplicate. This implementation is developed by Lampson [Lampson, 1981], who defined a set of operations on stable blocks which mirror the block service operation, the block pointers indicate that the stable storage blocks are to be distinguished from the ordinary blocks. Generally it is expected that the invariant duplicates of the stable storage, are stored in two different disk drives to ensure that the blocks are not damaged simultaneously in a single failure, so each block acts as a back-up to the other block. To maintain invariance for each pair of block: • • Not more than one pair of block is bad. If both are good, they have the most recent data except during the execution of stable put. The procedure Stable get operation reads one of the blocks using get block. The other representative is read when a error condition is detected. If during a stable put procedure a server crashes or halts, a recovery process is invoked while the server is re-started. The recovery procedure is meant to maintain invariance by inspecting the pair of blocks and doing the following: When • • • Both good and the same: nothing. Both good and different: Copies one block of the pair to the other block of the pair. One good and one bad: Copies the good block to the bad block. 6.7 File Cashing Caching is a common technique used to reduce the time it takes for a computer to retrieve information. The term cache is derived from the French word cacher, meaning "to hide. Ideally, recently accessed information is stored in a cache so that a subsequent repeat access to that same information can be handled locally without additional access time or burdens on network traffic When a request for information is made, the system's caching software takes the request, looks in the cache to see if it is available and, if so, retrieves it directly from the cache. If it is not present in the cache, the file is retrieved directly from its source, returned to the user, and a copy is placed in cache storage. Caching has been applied to the retrieval of data from numerous secondary devices such as hard and floppy disks, computer RAM, and network servers. Caching techniques are used to improve the performance of file access. The performance gain that can be achieved depends heavily on the locality of references and on the frequency of read and write operations. In a client-server system, each with main memory and a disk, there are four potential places to store(cache) files: the server's disk, the server's main memory, the client's disk, and the client's main memory. The server's disk is the most straightforward place to store all files. Files on the server's disk are accessible to all clients. Since there is only one copy of each file, no consistency problems arise. However, the main drawback is performance. Before a client can read a file, the file must first be transferred from the server's disk to the server's main memory, and then transferred over the network to the client's main memory. Server Disk C Server A B Client Disk Client D Internet Figure 2. Four file storage structures. Case a, when the storage in the server disk Case B, where the storage in the Server memory. Case C, where the storage is in the Client memory and D in the Client Disk Caches have been used in many operating systems to improve file system performance. Repeated accesses to a block in the cache can be handled without involving the disk. This feature has two advantages. First, caching reduces delays; a block in the cache can usually be returned to a waiting process five to ten times more quickly than one that must be fetched from the disk. Second, caching reduces contention for the disk arm, which may be advantageous if several processes are attempting to access simultaneously files on the same disk. However, since main memory is invariably smaller than the disk, when the cache fills up and some of the current cached blocks must be replaced. If an up-to-date copy exists on the disk, the cache copy of the replaced cache block is just discarded. Otherwise, the disk is first updated before the cached copy is discarded. A caching scheme in a distributed file system should address the following design decisions: • • • • The granularity of cached data. The location of the client's cache (main memory or local disk). How to propagate modifications of cached copies. How to determine if a client's cached data is consistent. The choices for these decisions are intertwined and related to the selected sharing semantics. 6.7.1 Cache Unit Size: The size of the cache can be either pages of a file or the entire file itself. For access patterns that have strong locality of reference, caching a large part of the file results in a high hit ratio, but at the same time, the potential for consistency problems also increases. Furthermore, If the entire file is cached, it can be stored contiguously on the disk (or at least in several large chunks), allowing high-speed transfers between the disk and memory and thus improves performance. Caching entire files also offers other advantages, such as fault-tolerance. This is because remote failures are visible only at the time of open and close operations, supporting disconnected operation of clients which already have the file cached. Whole file caching also simplifies cache management, since clients only have to keep track of files and not individual pages. However, caching entire files have two drawbacks. First, files larger than the local memory space (disk or main memory) cannot be cached. Second the latency of open requests is proportional to the size of the file and can be intolerable for large files. If parts (blocks) of file stored in the cache, the cache and disk space is used more efficiently. This scheme uses read-ahead technique to read blocks from the server disk and buffer them on both the server and client sides before they are actually needed in order to speed up the reading process. Increasing the caching unit size increases the likelihood that the data for the next access will be found locally (i.e., the hit ratio is increased); on the other hand, the time required for the data transfer and the potential for consistency problems are increased. Selecting the unit size of caching depends on the network transfer unit and the communication protocol being used. Earlier versions of the Andrew file system (AFS-1 and AFS-2) Coda and Amoeba cached the entire files. AFS-3 uses partial-file caching, but its use has not demonstrated substantial advantages in usability or performance, over the earlier versions. When caching is done at large granularity, considerable performance improvement can be obtained by the use of specialized bulk transfer protocols, which reduce the latency associated with transferring the entire file. 6.7.2 Cache Location: The cache location can be either at the server side, client side or both. Furthermore, the cache can be also either in the main memory or in the disk. The server caching eliminates accessing the disk on each access, but it still requires using the network to access the server. Caching at the client side can avoid using the network. Disk caches have one clear advantage in reliability and scalability. Modifications to the cached data won't be lost even when the system crashes, and there is no need to fetch the data again during recovery. Disk caches contribute to scalability by reducing network traffic and server loads during client crashes. In Andrew and Coda, cache is on the local disk, with a further level of caching provided by the Unix kernel in main memory. On the other hand, caching in the main memory has four advantages. First, main-memory caches permit workstations to be diskless, which make them cheaper and quieter. Second, data can be accessed more quickly from a cache in the main memory than a cache on the disk. Third, physical memories on the client workstations are now large enough to provide high hit ratios. Fourth, the server caches will be in the main memory regardless of where the client caches are located. Thus mainmemory caches emphasize reduced access time while disk caches emphasize increased reliability and autonomy of single machines. If the designers decide to put the cache in the client's main memory, three options are possible as shown in Figure 3. 1. Caching within each process. The simplest way is to cache files directly inside the address space of each user process. Typically, the cache is managed by the system call library. As files are opened, closed, read, and written, the library simply keeps the most heavily used ones around so that when a file is reused, it may already be available. When the process exits, all modified files are written back to the server. Although this scheme has an extremely low overhead, it is only effective if individual processes open and close files repeatedly. 2. Caching in the kernel. The kernel can dynamica1ly decide how much memory to reserve for programs and how much for the cache. The disadvantage here is that a kernel call is needed in all cache accesses, even on cache hits. 3. The cache manager as a user process. The advantage of a user-level cache manager is that it keeps the microkernel operating system free of the file system code. In addition, the cache manager is easier to program because it is completely isolated, and is more flexible. • Write Policy: The write policy determines the way the modified cache blocks (dirty blocks) are written back to their files on the server. The write policy has a critical effect on the system's performance and reliability. There are three write policies: write-through, delay-write, and write-on-close. 1. Write-through: the write-through policy is to write data through to the disk as soon as it is placed in any cache. A write-through policy is equivalent to using remote service for writes and exploiting caches for reads only. This policy has the advantage of reliability, since little data is lost if the client crashes. However, this policy requires each write access to wait until the information is written to the disk, which results in poor write performance. Write-through and variations of delayed-write policies are used to implement the UNIX like semantics of sharing. 2. Delay-write: blocks are initially written only to the cache and then written through to the disk or server some time later. This policy has two advantages over write-through. First, since writes are to the cache, write access completes much more quickly. Second, data may be deleted before it is written back, in this case it needs not be written at all. Thus a policy that delays writes several minutes can substantially reduce the traffic to the server or disk. Unfortunately, delayed write schemes introduce reliability problems, since unwritten data will be lost whenever a server or client crashes. Sprite file system uses this policy with a 30-second delay interval. 3. Write-on-close: to write data back to the server when the file is closed. The write-onclose policy is suitable for implementing session semantics, but fail to give considerable improvement on performance for files which are open for a short while. It also increases the latency for close operations. This approach is used in the Andrew file system and Network File System (NFS). There is a tight relation between the write policy and semantics of file sharing. Writeon-close is suitable for session semantics. When files are updated concurrently and occur frequently in conjunction with UNIX semantics, the use of delayed-write will result in long delays. A write-through policy is more suitable for UNIX semantics under such circumstances. 6.7.3 Client cache coherence in DFS Cache coherence is the fact of reading the latest copy of the file. In Unix systems, the user can always access the latest update. When working in a distributed computing environment, this problem arises especially if you are cashing at the client’s machine. To solve, we need to relax the problem by using the session. Open the file, update and then close. So it is based on session semantics rather than read and write semantics. Updating through the user will generate high traffic. The following methods are used to maintain coherence (according to a model, e.g. UNIX semantics or session semantics) of copies of the same file at various clients: Write-through: writes sent to the server as soon as they are performed at the client high traffic, requires cache managers to check (modification time) with server before can provide cached content to any client Delayed write: coalesces multiple writes; better performance but ambiguous semantics Write-on-close: implements session semantics Central control: file server keeps a directory of open/cached files at clients -> Unix semantics, but problems with robustness and scalability; problem also with invalidation messages because clients did not solicit them. 6.7.4 Cache Validation and Consistency: Cache validation is required to find out if the data in the cache is a stale copy of the master copy. If the client determines that its cached data is out of date, then future accesses can no longer be served by that cached data. An up-to-date copy of the data must be brought over from the file server. There are basically two approaches to verifying the validity of the cached data: 1. Client-initiated approach. A client-initiated approach for validation involves contacting the server to check if both have the same version of the file. Checking is done usually by comparing header information such as a time-stamp of updates or a version number (e.g., time stamp of the last update which is maintained in the i-node information in UNIX). The frequency of the validity check is the crux of this approach and it can vary from being performed with each access to a check initiated over a fixed interval of time. This method can cause severe network traffic, depending upon the frequency of checks. When it is performed with every access, the file access experiences more delay than the file access served immediately by the cache. Depending on its frequency, this kind of validity check can cause severe network traffic, as well as consume precious server CPU time. This phenomenon was the cause for Andrew designers to withdraw from this approach. 2. Server-initiated approach. In the server-initiated approach, whenever a client caches an object, the server hands out a promise (called a callback or a token) that it will inform the client before allowing any other client to modify that object. This approach enhances performance by reducing network traffic, but it also increases the responsibility of the server in direct proportion to the number of clients being served, not a good feature for scalability. The server records for each client the (parts of) files the client caches. Maintaining information on clients has significant fault tolerance implications. A potential for inconsistency occurs when a file is cached in conflicting modes by two different clients (i.e., at least one of the clients specified a write mode). If session semantics is implemented, whenever a server receives a request to close a file that has been modified, it should react by notifying the clients to discard their cached data and consider it invalid. Clients having this file open at that time discard their copy when the current session is over. Other clients discard their copy at once. Under session semantics, the server needs not be informed about Opens of already cached files. The server is informed about the Close of a writing session, however. On the other hand, if a more restrictive sharing semantics is implemented, like UNIX semantics, the server must be more involved. The server must be notified whenever a file is opened, and the intended mode (Read or Write) must be indicated. Assuming such notification, the server can act when it detects a file that is opened simultaneously in conflicting modes by disabling caching for that particular file (as done in Sprite). Disabling caching results in switching to a remote access mode of operation. The problem with the serve-initiated approach is that it violates the traditional client-server model, where clients initiate activities by requesting the desired services. Such a violation can result in irregular and complex code for both clients and servers. The implementation techniques for cache consistency check depend on the semantics used for sharing files. Caching entire files is a perfect match for session semantics. Read and Write accesses within a session can be handled by the cached copy, since the file can be associated with different images according to the semantics. The cache consistency problem is reduced to propagating the modifications performed in a session to the master copy at the end of a session. This model is quite attractive since it has simple implementation. Observe that coupling this semantics with caching parts of files may complicate matters, since a session is supposed to read the image of the entire file that corresponds to the time it was opened. A distributed implementation of the UNIX semantics using caching has serious consequences. The implementation must guarantee that at all times only one client is allowed to write to any of the cached copies of the same file. A distributed conflict resolution scheme must be used in order to arbitrate among clients wishing to access the same file in conflicting modes. In addition, once a cached copy is modified, the changes need to be propagated immediately to the rest of the cached copies. Frequent writes can generate tremendous network traffic and cause long delays before requests are satisfied. This is why implementations (e.g., Sprite) disable caching altogether and resort to remote service once a file is concurrently open in conflicting modes. Observe that such an approach implies some form of a server-initiated validation scheme, where the server makes a note of all Open calls. As was stated, UNIX semantics lend itself to an implementation where all requests are directed and served by a single server. The immutable shared files semantics eliminates the cache consistency problem entirely since the files can not be written. The transactions-like semantics can be implemented in a straightforward manner using locking. In this scheme, all the requests for the same file are served by the same server on the same machine as is done in the remote service. For session semantics we can easily implement, cache consistency by propagating changes to the master copy after closing the file. For implementing UNIX semantics, we have to propagate write to a cache, not only to the server, but also to other clients having a stale copy of the cache. This may lead to a poor performance and that is why many DFS (such as Sprite) switch to remote service when a client opens a file in a conflicting mode. Write-back caching is used in Sprite and Echo. Andrew and Coda use a write-through policy, for implementation simplicity and for reducing the chances of server data being stale due to client crashes. Both these systems use deferred write-back while operating in the disconnected mode, during server or network failures. Maintaining cache coherence is unnecessary if the data in the cache is treated as a hint and is validated upon use. File data in a cache cannot be used as a hint since the use of a cached copy will not reveal whether it is current or stale. Hints are most often used for file location information in DFS. Andrew for instance caches individual mappings of volumes to servers. Sprite caches mappings of pathnames prefixes to servers. Caching can thus handle a substantial amount of remote accesses in an efficient manner. This leads to performance transparency. It is also believed that client caching is one of the main contributing factors towards fault-tolerance and scalability. The effective use of caching can be done by studying the usage properties of files. For instance we could have write-through if we know that the sequential write-sharing of user files is uncommon. Also executables are frequently read, but rarely written and are very good candidates for caching. In a distributed system, it may be very costly to enforce transaction like semantics, as required by databases, which exhibit poor locality, fine granularity of update and query and frequent concurrent and sequential write sharing. In such cases, it is best to provide explicit means, outside the scope of the DFS. This is the approach followed in the Andrew and Coda DFS. 6.7.5 Comparison of Caching and Remote Service The choice between caching and remote service is a choice between potential for improved performance and simplicity. Following are the advantages and disadvantages of the two methods. • • • • • • When using caching scheme, a substantial amount of the remote access can be handled efficiently by the local cache. In DFSs, such scheme's goal is to reduce network traffic. In remote access, there is an excessive overhead in network traffic and increase in the server load. The total network overhead in transmitting big chunks of data, as done in caching, is lower than when series of short responses to specific requests are transmitted. The cache consistency problem is the major drawback to caching. When writes are frequent, the consistency problems incur substantial overhead in terms of performance, network traffic, and server load. To use caching and benefit from it, clients must have either local disks or large main memories. Clients without disks can use remote-service method without any problems. In caching, data is transferred all together between the server and client and not in response to the specific needs of a file operation. Therefore, the interface of the server is quite different from that of the client. On the other hand, in the remote service the interface of the server is just an extension of the local file system interface across the network. It is hard to emulate the sharing semantics of a centralized system (Unix sharing semantics) in a system using caching. While using remote service, it is easier to implement and maintain the Unix sharing sematnics. 6.8 Concurrency Control 6.8.1 Transaction in a Distributed File System The term atomic transaction is used to describe the phenomenon of a single client carrying out a sequence of operations on a shared file without interference from another client. The net result of every transaction must be the same as if each transaction is performed at a completely separate instance of time. The atomic transaction ( in a file service ) enables a client program to define a sequence of operations on a file without the interference of any other client program to ensure a justifiable result. Synchronization of the operations by a file server that supports the transaction, must be done to ensure the above criteria. Also if the file, undergoing any modification by the file service, faces any unexpected server or client process halts due to a hardware error or the software fault before the transaction is completed, the server ensures the subsequent restoration of the file to the original state before the transaction started. Though it is a sequence of operations, the atomic transaction can be viewed to be a single step operation from the clients point of view, to restore from one stable state to the other one. Either the transaction will be done successfully or the file status will be restored to the original one. The atomic transaction must satisfy two criteria to prevent the conflict between two concurrently accessing client processes requesting operations in the same data item concurrently. First it should be recoverable. Secondly the concurrent execution of several atomic transactions must be serially equivalent, i.e. the effect of several concurrent transactions would be the same if is it done one at a time. To ensure the proper atomicity of transactions concurrency controlling is done via locking, time stamping, optimistic concurrency control etc., details of which is explained later in the concurrency control chapter. 6. 9 Security and Protection To encourage sharing of files between users, the protection mechanism should allow a wide range of policies to be specified. As we discussed in the directory service, there are two important techniques to access files: 1) Capability-based access, and 2) Access Control List. The client process that has a valid Unique File Identifier (UFID) can use the file service to access the file, by the help of the directory service, which stores mappings from users' names for files to UFIDs. When a service or a group passes the authentication check, such as, a name or a password check they are given a UFID, which generally contains a large sparse number to reduce the counterfeiting. The UFIDs are issued after the user or the service is registered in the system. The authentication is done by the authentication service that maintains a table of user names, service names, passwords and the corresponding user identifier (ID). Each file has an owner (initially the creator) whose password is stored in the attributes of the created file, that will be subsequently used by the identity based file access control scheme. An access control list contains the user IDs of all the users who are entitled to access the file directly or indirectly. Generally the owner of the file can perform all file operations on using the file service. The other members have lesser access in the same file (e.g., read-only). The users of any file can be classified based on their requirements and needs to access a given file as follows. • The file's owner. • The directory service who is responsible for controlling the access and mapping the file by its text names. • A client who is given special permissions to access the file on behalf of the owner to manage the file contents and thereby is recognized by the system manager. • All other clients. In large distributed systems, simple extensions of the mechanisms used by time-sharing operating systems are not sufficient. For example some systems implement authentication by sending a password to the server, which then validates it. Besides being risky, the client is not certain of the identity of the server. Security can be built on the integrity of a relatively small number of servers, rather than a large number of clients, as is done in the Andrew file system. The authentication function is integrated with the RPC' mechanism. When a user logs on, the user’s password is used as a key to establish a connection with an authentication server. This server hands the user a pair of authentication tokens, which are used by the user to establish secure RPC connections with any other server. Tokens expire periodically, typically in 24 hours. When making an RPC call to a server, the client supplies a variable-length identifier and the encryption key to the server. The server looks up the key to verify the identity of the client. At the same time, the client is assured that the server has the capability of looking up its key and hence can be trusted. Randomized information in the handshake guards against replays by suspicious clients. It is important that the authentication servers and file servers run on physically secured hardware and safe software. Furthermore, there may be multiple redundant instances of the authentication server, for greater availability. Access Rights In a DFS, there is more data to protect from a large number of users. The access privileges provided by the native operating systems are either inadequate or absent. Some DFS such as Andrew and Coda, maintain their own schemes for deciding access rights. Andrew implements a hierarchical access-list mechanism, in which a protection domain consists of users and groups. Membership privileges in a group are inherited and the user's privileges are the sum of the privileges of all the groups that he or she belongs to. Also privileges are specified for a unit of a file system such as directories, rather than individual files. Both these factors simplify the state information to be maintained. Negative access rights can also be specified, for quick removal of a user from critical groups. In case of conflicts, negative rights overrule positive rights. 6.10 Case Studies 6.10.1 SUN Network File System ( NFS ) The Network File System (NFS) is designed, has been specified and implemented by Sun Microsystems Inc. since 1985. Overview NFS views a set of interconnected workstations as a set of independent machines with independent file systems. It allows some degree of sharing based on a client-server relationship among the file systems in a transparent manner. A machine may be both a client and server. Sharing is allowed between any pair of machines, not only with dedicated server machines. Consistent with the independence of machines is the fact that sharing of a remote file system affects only the client and no other machine. Hence there is no notion of a globally shared file system as in Locus, Sprite and Andrew. Advantages of Sun's NFS • Support diskless Sun workstations entirely by way of the NFS protocol. • It provides the facility for sharing files in a heterogeneous environment of machines, operating systems, and networks. Sharing is accomplished by mounting a remote file system, then reading or writing files in place. • It is open-ended. Users are encouraged to interface it with other systems. It was not designed by extending SunOS into the network. Instead operating system independence was taken as a NFS design goal, along with machine independence, simple crash recovery, transparent access, maintenance of UNIX file system semantics, and reasonable performance. These advantages have made NFS a standard in the UNIX industry today. NFS Description NFS provides transparent file access among computers of different architectures over one or more networks and keeps different file structures and operating system transparent to users. A brief description of the salient points is given below. It is a set of primitives that defines the operations, which can be made on a distributed file system. The protocol is defined in terms of a set of Remote Procedure Call (RPC), their arguments and results, and their effects. NFS protocol 1. RPC and XDR: RPC mechanism is implemented as a library of procedures plus a specification for portable data transmission, known as the External Data Representation (XDR). Together with RPC, XDR provides a standard I/O library for interprocess communication. The RPCs are used for defining NFS protocol. They are null(), lookup(), create(), remove(), getattr(), setattr(), read(), write() , rename(), link(), simlink(), readlink(), mkdir() , rmdir(), readdir(), statfs(). The most common NFS procedure parameter is a structure called file handle, which is provided by the server and used by retrying the call until the packet gets through. 2. Stateless protocols: The NFS protocol is stateless because each transaction stands on its own. The server does not keep track of any past client requests. 3. Transport independent: New transport protocol can be plugged into the RPC implementation without affecting the higher-level protocol code. In the current implementation, NFS uses UDP/IP protocol as the transport protocol. The UNIX operating system does not guarantee that the internal file identification is unique within a local area network. In distributed system, it is possible for a file or a file system identification to be the same for another file on a remote system. To solve this problem, Sun has added a new file system interface to the UNIX operating system kernel. This improvement can uniquely locate and identify both local and remote files. The file system interface consists of the virtual file system (VFS) interface and the Virtual Node (vnode) interface. Instead of the inode the operating system deals with the vnode. Client Server System calls interface VFS interface Other types of file systems Unix 4.2 file systems VFS interface NFS client NFS server RPC/XDR Unix 4.2 file systems RPC/XDR disk disk Network Figure 6.5 Schematic view of NFS architecture When a client makes a request to access a file, it goes through the VFS that decides whether the file is local or remote. It uses the vnode to determine if the file is local or remote. If it is local, it refers to the I-node and the file is accessed as any other Unix files. If it is remote, the file handler is identified which in turn uses the RPC protocol to contact the remote server and obtain the required file. VFS defines the procedure and data structures that operate on the file system as a whole and vnode interface defines the procedures that operate on an individual file within that file system type Pathname Translation Pathname translation is done by breaking the path into component names and doing a separate NFS lookup call for every pair of component name and directory vnode. Thus, lookups are performed remotely by the server. Once a mount point is crossed every component lookup causes a separate RPC to the server. This expensive pathname traversal is needed, since each client has a unique layout of its logical name space, dictated by the mounts if performed. A directory name lookup cache at the client, which holds the vnodes for remote directory names, speeds up references to files with the same initial pathname. The cache is discarded cache is discarded when attributes returned from the server do not match the attributes of the cached node. Caching There is a one-to-one correspondence between the regular UNIX system calls for file operations and NFS protocol RPCs with the exception of opening and closing files. Hence a remote file operation can be translated directly to the corresponding RPC. Conceptually, NFS adheres to the remote service paradigm but in practice buffering and caching techniques are used for the sake of performance, i.e. no correspondence exists between a remote operation and a RPC. File blocks and file attributes are fetched by the RPCs and cached locally. Caches are of two types: file block cache and file attribute (i-node information) cache. On a file open, the kernel checks with the remote server about whether to fetch or revalidate the cached attributes by comparing the time-stamps of the last modification. The cached file blocks are used only if the corresponding cached attributes are up to date. Both read-ahead and delayed-write techniques are used between the client and server. Clients do not free delayed write blocks until the server confirms the data is written to the disk Performance tuning of the system makes it difficult to characterize the sharing semantics of NFS. New files created on a machine may not be able to visible elsewhere for a short duration of time. It is indeterminate whether writes to a file at one site are visible to other sites that have the file open for reading. New opens of that file observe only the changes that have been flushed to the server. Thus NFS fails to provide strict emulation of UNIX semantics. Summary • • • Logical name structure: Global name hierarchy does not exist, i.e. every machine establishes its own view of the name structure. Each machine has its own root serving as private and absolute point of reference for its own view of the name structure Hence users enjoy some degree of independence, flexibility and privacy sometimes at the expense of administrative complexity Remote service: When a file is accessed transparently I/O operations are performed according to the remote service method, i.e. the data in the file is not fetched at once instead the remote site potentially participates in each read and write operation. Fault tolerance: Stateless approach for the design of the servers results in resiliency to client, server, or network failures. • Sharing semantics: NFS does not provide UNIX semantics for concurrently open files Server 1 Client (root) (root) export . . . vmunix Server 2 (root) usr nfs Remote people big jon bob . . . mount Remote students x staff mount users jim ann jane joe Figure 6.8 Local and remote file systems accessible on an NFS client. The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in Server 1; the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2. 6.10.2 Sprite Sprite is featured by its performance. Sprite uses memory caching as a main tool to achieve good performance. Sprite is an experimental distributed system developed at the University of California at Berkeley. It is a part of the SPUR project, whose goal is the design and construction of high performance multiprocessor workstation. Overview Designers of Sprite envisioned the next generation of workstations as powerful machines with vast main memory of 100 to 500 MB. By caching files from dedicated servers the physical memories compensate for lack of local disks in diskless workstations. Features Sprite uses ordinary files to store data and stacks of running processes instead of special disk partitions used by many versions of UNIX. This simplifies process migration and enables flexibility and sharing of the space allocated for swapping. In Sprite, clients can read random pages from a server’s (physical) cache faster than from a local disk, which shows that a server with a large cache may provide better performance than from a local disk. The interface provided by Sprite is very similar to the one provided by UNIX where a single tree is encompasses all the files and devices in a network making them equally and transparently accessible from every workstation. Location transparency is complete i.e. a file’s network location cannot be discerned from its name. Description Caching An important aspect of the sprite file system is the capitalizing on the large main memories and advocating diskless workstations, storing file caches in-core. The same caching scheme is used to avoid local disk accesses as well as to speed up remote accesses. In sprite, file information is cached in the main memories of both servers (workstations with disks), and clients (workstations wishing to access files on nonlocal disks). The caches are organized on block basis, each being 4Kb. Sprite does not use read ahead to speed up sequential read, instead it uses delayed write approach to handle file modifications. Exact emulation of UNIX semantics is one of Sprite’s goals. A hybrid cache validation method is used for this end. Files are associated with version number. When a client opens a file, it obtains the file’s current version number from the server and compares this number to the version number associated with the cached blocks for that file. If the version numbers are different, the client discards all cached blocks for the file and reloads its cache from the server when the blocks are needed. Looking up files with prefix tables Sprite presents its user with a single file system hierarchy that is composed of several subtrees called domains. Each server stores one or more domains. Each machine maintains a server map called prefix table. This table maps domains to servers. This mapping is built and updated dynamically by a broadcast protocol. Every entry in a prefix table corresponds to one of the domains. It contains the pathname of the topmost directory in the domain, the network address of the server storing the domain, and a numeric designator identifying the domain’s root directory for the storing server. This designator is an index into the server table of open files; it saves repeating expensive name translation. Every lookup operation for an absolute pathname starting with the client searching for its prefix table for the longest prefix matching the given file name. The client strips the matching prefix from the file name and sends the remainder of the name to the selected server along with the designator from the prefix table entry. The server uses this designator to locate the root directory of the domain, then proceeds by usual UNIX pathname translation for the remainder of the file name. When the server succeeds in completing the translation, it replies with a designator for the open file. Location Transparency Like almost all modern network systems, Sprite achieved location transparency. This means that the users should be able to manipulate files in the same ways they did under time-sharing on a single machine; the distributed nature of the file system and the techniques used to access remote files should be invisible to users under normal conditions. Most network file systems fail to meet the transparency goal in one or more ways. Earliest systems allowed remote file access only with a few special programs. Second generation systems allow any application to access files on any machine in the network, but special names must be used for remote files. Third generation network file systems such as Sprite and Andrew, provide transparency. Sprite provides complete transparency, so applications running on different workstations see the same behavior they would see if all applications were executing on a single timeshared machine. Also sprite provides transparent access to remote I/O devices. Like UNIX, Sprite represents device as special files, unlike most versions of UNIX. It also allows any process to access any device, regardless of device location. Summary The Sprite file system can be summarized by the following points. • Semantics: Sprite sacrifices performance in order to emulate UNIX semantics thus eliminating the possibility and benefits of caching in big chunks. • Extensive use of caching: Sprite is inspired by the vision of diskless workstations with huge main memories and accordingly relies heavily on caching. • Prefix tables: For LAN based systems prefix tables are a most efficient, dynamic, versatile and robust mechanism for file lookup the advantages being the built-in facility for processing whole prefixes of pathnames and the supporting broadcast protocol that allows dynamic changes in tables. 6.9.3 Andrew File System Andrew, which is distinguished by its scalability, is a distributed computing environment that has been under development since 1983 at CMU. The Andrew file system constitutes the underlying information-sharing mechanism among users of the environment. The most formidable requirements of Andrew is its scale. Overview Andrew distinguishes between client machine and dedicated server machines. Clients are presented with a partitioned space of file names: a local name space and a shared name space. A collection of dedicated servers, collectively called Vice represents the shared name space to the clients as an identical and location-transparent file hierarchy. The local name space is the root file system of a workstation from which the shared name space descends. (Figure ) Workstations are required to have local disks where they store their local name space, whereas servers collectively are responsible for the storage and management of the shared name space. The local name space is small and distinct from each workstation and contains system programs essential for autonomous operation and better performance, Temporary files, and files the workstation owner explicitly wants, for privacy reasons, to store locally. Local Shared / (root) tmp bin . . . vmunix cmu bin Symbolic links Figure 6.8. Andrew’s name space The key mechanism selected for remote file operations is whole file caching. Opening a file causes it to be cached, in its entirety, in the local disk. Reads and writes are directed to the cached copy without involving the servers. Entire file caching has many merits, but cannot efficiently accommodate remote access to very large files. Thus, a separate design will have to address the issue of usage of large databases in the Andrew environment. Features • • • User mobility: Users are able to access any file in the shared name space from any workstation. The only noticeable effect of a user accessing files not from the usual workstation would be some initial degraded performance due to the caching of files. Heterogeneity: Defining a clear interface to Vice is a key for integration of diverse workstation hardware and operating system. To facilitate heterogeneity, some files in the local /bin directory are symbolic links pointing to machine-specific executable files residing in Vice. Protection: Andrew provides access lists for protecting directories and the regular UNIX bits for file protection. The access list mechanism is based on recursive groups structure. Workstations Servers User Venus program Vice UNIX kernel UNIX kernel Venus User program Network UNIX kernel Vice Venus User program UNIX kernel UNIX kernel Figure 6.9 Distribution of processes in the Andrew File System Description Scalability of the Andrew File system There are no magic guidelines to ensure the scalability of a system. But the Andrew file system presents some methods to make it scalable. Location Transparency Andrew offers true location transparency: the name of a file contains no location information. Rather, this information is obtained dynamically by clients during normal operation. Consequential, administrative operations such as the addition or removal of servers and the redistribution of files on them are transparent to users. In contrast, some file systems require users to explicitly identify the site at which a file is located. Location transparency can be viewed as a binding issue. The binding of location to name is static and permanent when pathnames with embedded machine names are used. The binding is dynamic and flexible in Andrew. Usage experience has confirmed the benefits of a fully dynamic location mechanism in a large distributed environment. Client Caching The caching of data at clients is undoubtedly the architectural feature that contributed most to scalability in a distributed file system. Caching has been an integral part of the Andrew designs from the beginning. In implementing caching one has to make three decisions: where to locate the cache, how to maintain cache coherence, and when to propagate modifications. Andrew cache on the local disk, with a further level of file caching by the Unix kernel in main memory. Disk caches contribute to scalability by reducing network traffic and server load on client reboots. Cache coherence can be maintained in two ways. One approach is for the client to validate a cached object upon use. A more scalable approach is used in Andrew. When a client caches an object, the server hands out a promise (called callback or token) that it will notify the client before allowing any other client to modify that object. Although more complex to implement, this approach minimizes server load and network traffic, thus enhancing scalability. Callbacks further improve scalability by making it viable for clients to translate pathnames entirely locally. Existing systems use one of the approached to propagating modifications from client to server. Write-back caching, used in Sprite, is the more scalable approach. Andrew uses a write through caching scheme. This is a notable exception to scalability being the dominant design consideration in Andrew. Bulk Data transfer An important issue related to caching is the granularity of data transfers between client and server. Andrew uses whole-file caching. This enhances scalability by reducing server load, because clients need only contact servers on file open and close requests. The far more numerous read and write operations are invisible to servers and cause no network traffic. Whole-file caching also simplifies cache management because clients only have to keep track of files, not individual pages, in their cache. When caching is done at large granularity, considerable performance improvement can be obtained by the use of a specialized bulk data-transfer protocol. Network communication overhead caused by protocol processing typically accounts for a major portion of the latency in a distributed file system. Transferring data in bulk reduces this overhead. Token-Based Mutual Authentication The approach used in Andrew to implement authentication is to provide a level of indirection using authentication tokens. When a user logs in to a client, the password typed in is used as the key to establish a secure RPC connection to an authentication server. A pair of authentication tokens are then obtained for the user on this secure connection. These tokens are saved by the client and are used by it to establish secure RPC connections on behalf of the user to file servers. Like a file server, an authentication server runs on physically secure hardware. To improve scalability and to balance load, there are multiple instances of the authentication server. Only one instance accepts updates; the others are slaves and respond only to queries. Hierarchical Groups and Access Lists Controlling access to data is substantially more complex in large-scale systems than it is in smaller systems. There is more data to protect and more users to make access control decisions about. To enhance scalability, Andrew organize their protection domains hierarchically and support a full-fledged access-list mechanism. The protection domain is composed of users and groups. Membership in a group is inherited, and a user’s privileges are the cumulative privileges of all groups he or she belongs to, either directly or indirectly. Andrew uses an access-list mechanism for file protection. The total rights specified for a user are the union of the rights specified for him and the groups he or she belongs to. Access lists are associated with directories rather than individual files. The reduction in state obtained by this design decision provides conceptual simplicity that is valuable at large scale. Although the real enforcement of protection is done on the basis of access lists, Venus superimposes an emulation of Unix protection semantics by honoring the owner component of the Unix mode bits on a file. The combination of access lists on directories and mode bits on files has proved to be an excellent compromise between protection at fine granularity, scalability, and Unix compatibility. Data Aggregation In a large system, consideration of interoperability and system administration assume major significance. To facilitate these functions, Andrew organize file system data into volumes. A volume is a collection of files located on one server and forming a partial subtree of the Vice name space. Volumes are invisible to application programs and are only manipulated by system administrators. The aggregation of data provided by volumes reduces the apparent size of the system as perceived by operators and system administrators. Our operational experience in Andrew confirms the value of the volume abstraction in a large distributed file system. Decentralized Administration A large distributed system is unwieldy to manage as a monolithic entity. For smooth and efficient operation, it is essential to delegate administrative responsibility along lines that parallel institutional boundaries. Such a system decomposition has to balance site autonomy with the desirable but conflicting goal of system-wide uniformity in human and programming interfaces. The cell mechanism of AFS-3 is an example of a mechanism that provides this balance. A cell corresponds to a completely autonomous Andrew system, with its own protection domain, authentication and file servers, and system administrators. A federation of cells can cooperate in presenting users with a uniform, seamless file name space. Heterogeneity As a distributed system evolves it tends to grow more diverse. One factor contributing to diversity is the improvement in performance and decrease in cost of hardware over time. This makes it likely that the most economical hardware configurations will change over the period of growth of the system. Another source of heterogeneity is the use of different computer platforms for different applications. Andrew did not set out to be a heterogeneous computing environment. Initial plans for it envisioned a single type of client running one operating system, with the network constructed of a single type of physical media. Yet heterogeneity appeared early in its history and proliferated with time. Some of this heterogeneity is attributed to the decentralized administration typical of universities, but much of it is intrinsic to the growth and evolution of any distributed system. Coping with heterogeneity is inherently difficult, because of the presence of multiple computational environments, each with its own notions of file naming and functionality. The PC Server [Rifkin, 1987] is used to perform the function in the Andrew environment. Summary The highlights of the Andrew file system are: • Name space and service model: Andrew explicitly distinguishes among local and shared name spaces as well as among clients and servers. Clients have small and distinct local name space and can access the shared name space managed by the servers. • Scalability: Andrew is distinguished by its scalability, the strategy adopted to address scale is whole file caching in order to reduce server load. Servers are not involved in reading and writing operations. The callback mechanism was invented to reduce the number of validity checks. • Sharing semantics: Andrew’s semantics which are simple and well-defined ensure that a file’s updates are visible across the network only after a file has been closed. 6.9.4 Locus Overview Locus is an ambitious project aimed at building a full-scale operating system. The features of Locus are automatic management of replicated data, atomic file updates, remote tasking, ability of tolerate failures to a certain extent, and full implementation of nested transactions . The Locus has been operational in UCLA for several years on a set of mainframes and workstations connected by an Ethernet. The main component of Locus is its DFS. It represents a single tree structure naming hierarchy to users and applications. This structure covers all objects of all machines in the system. Locus is a location transparent system, i.e. from the name of an object you can not decide its location in the network. Features Fault tolerance issues have special emphasis in Locus. Network failures may split the network into two or more disconnected subnetworks (partitions). As long as at least one copy of a file is available in a subnetwork, read requests are served and it is still guaranteed that the version read is the most recent one available in that disconnected network. Upon reconnection of these subnetworks, automatic mechanisms take care of updating stale copies of files. Seeking high performance in the design of Locus has led to incorporating network functions like formatting, queuing, transmitting, and retransmitting messages into the operating system. Specialized remote procedure protocols were devised for kernel-tokernel communication. Lack of the multilayering (as suggested in the IS0 standard) enabled achieving high performance for remote operations. A file in Locus may correspond to a set of copies (replications) distributed in the system. It is the responsibility of Locus to maintain consistency and coherency among the versions of a certain file. Users have the option to choose the number and locations of the replicated files. Description Locus uses the logical name structure to hide both location and replication details from users and applications. A removable file system in Locus is called filegroup. A file group is the component unit in Locus. Virtually, logical filegroups are joined together to form a unified structure. Physically, a logical filegroup is mapped to multiple physical containers (packs) residing at various sites and storing replica of the files of that filegroup. These containers correspond to disk partitions. One of the packs is designated as the primary copy. A file must be stored at the site of the primary copy and can be stored at any subset of the other sites where there exists a pack corresponding to its filegroup. Hence, the primary copy stores the filegroup completely where the rest of the pack might be partial. The various copies of a file are assigned to the same i-node slot for all files it does not store. Data page numbers may be different on different packs, hence reference over the network to data pages use logical page numbers rather than physical ones. Each pack has a mapping of these logical numbers to physical numbers. To facilitate automatic replication management, each i-node of a file copy contains a version number, determining which copy dominates other copies. Synchronization of Accesses to Files Locus distinguishes between three logical roles in file accesses; each one is performed by a different site: 1. Using Site (US) issues the requests to open and access a remote file. 2. Storage Site (SS) is the selected site to serve the requests. 3. Currently Synchronization Site (CSS) enforces a global synchronization policy for a filegroup and selects an SS for each open request referring to a file in the filegroup. There is at most one CSS for each filegroup in any set of communicating sites. The CSS maintains the version number and a list of physical containers for every file in the filegroup. Reconciliation of filegroups at partitioned sites The basic approach to achieve fault tolerance in Locus is to maintain within a single subnetwork, consistency among copies of a file. The policy is to allow updates only in a partition that has the primary copy. It is guaranteed that the most recent version of a file in a partition is read. To achieve this the system maintains a commit count for each filegroup, enumerating each commit of every file in the filegroup. The commit operation consists of moving the incore i-node to the disk i-node. Each pack has a lowerwater-mark (lwm) that is a commit out value upto which the system guarantees that all prior commits are reflected in the pack. The primary copy pack (stored in the CSS) keeps a list of enumerating the files in a filegroup and the corresponding commit counts of the recent commits in the secondary storage. When a pack joins a partition it attempts to contact the CSS and checks whether its lvm is within the recent commit list range. If so the pack site schedules a kernel process that brings the pack to a consistent state by copying only the files that reflect commits later than that of the site’s lwm. If the CSS is not available writing is disallowed in this partition by reading is possible after a new CSS is chosen. The new CSS communicates with the partition members to itself informed of the most recent available version of each file in the filegroup. Then other pack sites can reconcile with it. As a result all communicating sites see the same view of the filegroup, and this view is as complete as possible given a particular partition. Since updates are allowed within the partition with the primary copy and Reads are allowed in the rest of the partitions, it is possible to read out-of-date replicas of a file. Thus Locus sacrifices consistency or the ability to continue to both update and read files in a partitioned environment. When a pack is out of date the system invokes an application-level process to bring the file group up to date. At this point the system lacks sufficient knowledge of the most recent commits to identify the missing updates. So the site inspects the entire i-node space to determine which files in its pack are out of date. Summary An overall profile of Locus can be summarized by the following issues. • Distributed operating system: Due to the multiple dimensions of a transparency in Locus it comes close to the definition of a distributed operating system in contrast to a collection of network services. • Implementation Strategy: Kernel operation is the implementation strategy in Locus the common pattern being kernel-to-kernel implementation via specialized high performance protocols. • Replication: Locus uses a primary copy replication scheme the main merit of which is the increased availability of directories that exhibit high read-write ratio. • Access synchronization: UNIX semantics are emulated to the last detail inspite of caching at multiple USs. Alternatively locking facilities are provided. • Fault tolerance: Some of the fault tolerance mechanisms in Locus are atomic update facility, merging replicated packs after recovery and a degree of independent operation of partitions. Conclusion: In this chapter, we were able discuss the characteristics of the File system which is a set of services that could be provided to the client (user). We introduced a group of terms such as NFS, Network file system, which is a case study in the chapter as well as Andrew file system. The chapter started by studying the File System Characteristics and Requirements. In this section, we defined the file system role, the file access granularity, file access type, transparency, network transparency, mobile transparency, performance, fault tolerance and scalability. In the file model and organization section, we compared the directory service, file service and block service and gave examples showing the difference between each service. In naming and transparency section, we discussed transparency issues related to naming files, naming techniques and implementation issues. In the naming files, we presented the naming transparency, network transparency, mobile transparency and location independence. Later in the chapter, we defined the sharing semantics. The most common types of sharing semantics are: Unix Semantics, Session Semantics, Immutable Shared Semantics and Transaction Semantics. Fault tolerance is an important attribute in distributed system, so we prepared a section in this chapter to discuss the methods to improve the fault tolerance of a DFS that could be summarized in improving availability and the use of redundant resources. Due to its importance in Distributed File systems, caching occupied a big area from our discussion in chapter 6. Caching is a common technique used to reduce the time it takes for a computer to retrieve information. Ideally, recently accessed information is stored in a cache so that a subsequent repeats access to that same information can be handled locally without additional access time or burdens on network traffic. It is not easy to design a cache system, a lot of factors should be taken in consideration such the cache unit size, cash location, client location and validation and consistency. The chapter goes over briefly over concurrency control and security issues which will be discussed deeply in the coming chapters and ends in discussing some case studies such as Sun NFS and Andrew file system. Questions: 1) What is a Distributed File system? 2) Briefly summarize the file system requirements. 3) What are the most common types of file sharing semantics? 4) Why fault tolerance is an important attribute to file systems? 5) Why do we need caching? What factors should be taken in consideration when designing cache systems 6) Briefly summarize the case studies provided in this chapter. 7) Present a file system that is not discussed in this chapter and compare it to NFS and AFS. References: [Chu, 1969] W. W. Chu. Optimal file allocation in a minicomputer information system, IEEE Transaction on Computer, C-l 8 10, Oct. 1969. [Gavish, 1990], Bezalel Gavish, R. Olivia and Liu Sheng, Dynamic file migration in distributed computer systems, Communicatins of the ACM, Vol. 33, No. 2, Feb. 1990. [Chung, 1991], Hsiao-Chung Cheng and Jang-Ping Sheu, Design and implementation of a distributed file system, Software-Practice and Experience, Vol. 21(7), P 657-675, July 1991 [Walker, 1983] B. Walker et al. “The LOCUS distributed operating system,” Proc. of 1983 SIGOPS Conference, PP. 49-70, Oct. 1983. [Coulouris and Dolimore, 1988] George F. Coulouris and Jean Dolimore, Addison Wesley, “Distributed Systems : Concepts and Design” 1988, P 18-20 [Nankano, ] X. Jia, H Nankano, Kentaro Shimizu and Moment Maekawa, “Highly Concurrent directory Management in Galaxy Distributed Systems”,Proceedings on International symposium on Database in Parallel and Distributed Systems, P 416- 426. [Thompson, 1931] K. Thompson, “UNIX Implementation”, Bell systems Technical Journal, Vol. 57, no : 6, part 2, P 1931-46. [Needham] R.M Needham and A.J Herbert, “The Cambridge Distributed Computing System”, Addison Wesley. [Lampson , 1981] “Atomic Transactions ‘I, B.W Lampson , Lamport et al., 1981 [Gidfford, 1988] D.K Gifford, “Violet : An experimental decentralized system”, ACM Operating System review, Vol. 3, No 5. “Network Programming”, Sun Micro Systems Inc. May 1988. [Rifkin, 1987] Rifkin, R.L Hamilton, M.P Sabrio S. Shah and K. Yueh, “R.F.S Architectural Overview”, USENIX, 1987. [Gould, 1987] Gould, “The Network File System implemented on 4.3 BSD”, USEXIX, 1987. [Howard, 1988] J. Howard et al “Scale and performance in a Distributed File Systems”. ACM Transaction on Computer Systems, P 55-8 1, Feb.’ 1988. “The design of a capability based operating system”, Computer Journal, Vol. 29, No 4, P 289-300. [levy, 1990A] E. Levy and A. Silberschatz, “Distributed File Systems : Concepts and Examples”, ACM Computer Surveys, P 321-374, Dec’1990. [Nelson, ] M. Nelson et al “Caching in the Sprite Network File System”, ACM transaction on computer Systems. [Mullender, 1990] Sape. J. Mullender and Guirdo Van Rossum “Ameoba - A Distributed Operating system for 1990’s”, CWI, Center for Mathematics and Computer Science. [ Needham, 1988] Gidfford, D.K., Needham, R.M., and Schroeder, M.D.: “The Cedar File System,” commun. of the ACM, vol. 3 1, pp. 288-298, March 1988. [Levy, 1990B] LEVY, E., and SILBERSCHATZ, A.: “Distributed File Systems: Concepts and Examples” Computing Surveys, ~01.22, pp. 321-374, Dec. 1990. [ Tanenbaum, 1992] Andrew S. Tanenbaum: “Modem Operating Systems” Prentice-Hall Inc., chapterl3, pp. 549-587, 1992. [Nelson, 1988] Nelson, M.N., Welch, B.B., and Ousterhout. J.K.: “Caching in the Sprite Network File System.” ACM Trans. on Computer Systems, vol. 6, pp. 134-154, Feb. 1988. [Richard, 1997] Richard S. Vermut, “File Caching on the Internet: Technical Infringement or Safeguard for EfficientNetwork Operation?” 4 J. Intell. Prop. L. 273 (1997)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download High Performance Distributed Computing Textbook