Download High Performance Distributed Computing Textbook

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Wake-on-LAN wikipedia , lookup

Lag wikipedia , lookup

Deep packet inspection wikipedia , lookup

Internet protocol suite wikipedia , lookup

IEEE 1355 wikipedia , lookup

Remote Desktop Services wikipedia , lookup

Airborne Networking wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Distributed firewall wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Peer-to-peer wikipedia , lookup

UniPro protocol stack wikipedia , lookup

Distributed operating system wikipedia , lookup

Transcript
High Performance Distributed Computing
Chapter 1
Introduction: Basic Concepts
Objective of this chapter:
Chapter one will provide an introduction to the Distributed Systems and how to
characterize them. In addition, the chapter will describe the evolution of distributed
systems as well as the research challenges facing the design of general purpose high
performance distributed systems.
Key Terms
Complexity, Grid structure, High performance distributed system.
1.1 Introduction
The last two decades spawned a revolution in the world of computing, a move away from
central mainframe-based computing to network-based computing. Today, workstation
servers are quickly achieving the levels of CPU performance, memory capacity, and I/O
bandwidth once available only in mainframes at a cost order of magnitude below that of
mainframes. Workstations are being used to solve computationally intensive problems in
science and engineering that once belonged exclusively to the domain of supercomputers.
A distributed computing system is the system architecture that makes a collection of
heterogeneous computers or workstations to act and behave as being one single
computing system. In such a computing environment, users can uniformly access and
name local or remote resources, and run processes from anywhere in the system, without
being aware of which computers their processes are running on. Many claims have been
made for distributed computing systems. In fact, it is hard to rule out any desirable
feature of a computing system that has not been claimed to be offered by a distributed
system [Comer et al, 1991]. However, the recent advances in computing, networking and
software have made feasible to achieve the following advantages:
• Increased Performance: The existence of multiple computers in a distributed system
allows applications to be processed in parallel and thus improve the application and
system performance. For example, the performance of a file system can be improved by
replicating its functions over several computers; the file replication allows several
applications to access that file system in parallel. Also, file replication results in
distributing the network traffic to access that file over different sites and thus reduces
network contention and queuing delays.
• Sharing of Resources: Distributed systems enable efficient access for all the system
resources. Users can share special purpose and sometimes expensive hardware and
1
High Performance Distributed Computing
software resources such as database server, compute server, virtual reality server,
multimedia information server and printer server, just to name a few.
• Increased Extendibility: Distributed systems can be designed to be modular and
adaptive so that for certain computations the system will configure itself to include a
large number of computers and resources while in other instances, it will just consist of a
few resources. Furthermore, the file system capacity and computing power can be
increased incrementally rather than throwing all the system resources to acquire higher
performance and capacity systems.
• Increased Reliability, Availability and Fault Tolerance: The existence of multiple
computing and storage resources in the distributed system makes it attractive and costeffective to introduce redundancy in order to improve the system dependability and faulttolerance. The system can tolerate the failure in one computer by allocating its tasks to
another available computer. Furthermore, by replicating system functions, the system can
tolerate one or more component failures.
• Cost-Effectiveness: The performance of computers have been improving by
approximately 50% per year while their cost is decreasing by half every year during the
last decade [Patterson and Hennessy, 1994]. Furthermore, the emerging high speed
optical network technology will make the development of distributed systems attractive
in terms of price/performance ratio when compared to those of parallel computers. The
cost-effectiveness of distributed systems has contributed significantly to the failure of the
supercomputer industry to dominate the high performance computing market.
These advantages or benefits can not be achieved easily because designing a general
purpose distributed computing system is several orders of magnitude more complex than
the design of centralized computing systems. The design of such systems is a
complicated process because of the many options that the designers must evaluate and
choose from such as the type of communication network and communication protocol,
the type of host-network interface, distributed system architecture (e.g., pool, clientserver, integrated, hybrid), the type of system level services to be supported (distributed
file service, transaction service, load balancing and scheduling, fault-tolerance, security,
etc.) and the type of distributed programming paradigms (e.g., data model, functional
model, message passing or distributed shared memory). Below is a list of the main issues
that must be addressed.
• Lack of good understanding of distributed computing theory. The field is relatively
new and to overcome that we need to experiment with and evaluate all possible
architectures to design general purpose reliable distributed systems. Our current approach
to design such systems is based on ad-hoc approach and we need to develop systemengineering theory before we can master the design of such systems. Mullender
compared the design of a distributed system to the design of a reliable national railway
system that took a century and half to be fully understood and mature [Bagley, 1993].
Similarly, distributed systems (which have been around for approximately two decades)
2
High Performance Distributed Computing
need to evolve into several generations of different design architectures before their
designs, structures and programming techniques can be fully understood and mature.
• The asynchronous and independence behavior of the computers complicate the
control software that aims at making them operate as one centralized computing system.
If the computers are structured in a master-slave relationship, the control software is
easier to develop and system behavior is more predictable. However, this structure is in
conflict with the distributed system property that requires computers to operate
independently and asynchronously.
• The use of a communication network to interconnect the computers introduces
another level of complexity; distributed system designers need not only to master the
design of the computing systems and their software systems, but also to master the design
of reliable communication networks, how to achieve efficient synchronization and
consistency among the system processes and applications, and how to handle faults in a
system composed of geographically dispersed heterogeneous computers. The number of
computers involved in the system can vary from a few to hundreds or even hundreds of
thousands of computers.
In spite of these difficulties, there has been a limited success in designing special purpose
distributed systems such as banking systems, on-line transaction systems, and point-ofsale systems. However, the design of a general purpose reliable distributed system that
has the advantages of both centralized systems (accessibility, management, and
coherence) and networked systems (sharing, growth, cost, and autonomy) is still a
challenging task [Stankovic, 1984]. Klienrock [Tannenbaum, 1988] makes an interesting
analogy between the human-made computing systems and the brain. He points out that
the brain is organized and structured very differently from our present computing
machines. Nature has been extremely successful in implementing distributed systems
that are far cleverer and more impressive than any computing machines humans have yet
devised. We have succeeded in manufacturing highly complex devices capable of highspeed computation and massive accurate memory, but we have not gained sufficient
understanding of distributed systems and distributed applications; our systems are still
highly constrained and rigid in their construction and behavior. The gap between natural
and man-made systems is huge and more research is required to bridge this gap and to
design better distributed systems.
3
High Performance Distributed Computing
Figure 1.1 An Example of a Distributed Computing System.
The main objective of this book is to provide a comprehensive study of the design
principles and architectures of distributed computing systems. We first present a
distributed system design framework to provide a systematic design methodology for
distributed systems and their applications. Furthermore, the design framework
decompose the design issues into several layers to enable us to better understand the
architectural design issues and the available technologies to implement each component
of a distributed system. In addition to addressing the design issues and technologies for
distributed computing systems, we will also focus on those that will be viable to build the
next generations of wide area distributed systems (e.g. Grid and Autonomic computing
systems) as shown in Figure 1.1.
1.2 Characterization of Distributed Systems
Distributed systems have been referred to by many different names such as distributed
processing, distributed data processing, distributed multiple computer systems,
distributed database systems, network-based computing, cooperative computing, clientserver systems, and geographically distributed multiple computer systems [Hwang and
Briggs, 1984]. Bagely [Kung, 1992] has reported 50 different definitions of distributed
systems. Other researchers feel acceptable to have many different definitions for
distributed systems and even warren against having one single definition of a distributed
system [Liebowitz, and Carson, 1985; Bagley, 1993]. Furthermore, many different
methods have been proposed to define and characterize distributed computing systems
and distinguish them from other types of computing systems. In what follows, we present
the important characteristics and services that have been proposed to characterize and
classify distributed systems.
4
High Performance Distributed Computing
• Logical Property. The distributed system is defined as a collection of logical units
that are physically connected through an agreeable protocol for executing distributed
programs [Liebowitz and Carson, 1985]. The logical notion allows the system
components to interact and communicate without knowing their physical locations in the
system.
• Distribution Property. This approach emphasizes the distribution feature of a
distributed system. The word “distributed” implies that something has been spread out or
scattered over a geographically dispersed area. At least four physical components of a
computing system can be distributed: 1) hardware or processing logic, 2) data, 3) the
processing itself, and 4) the control. However, a classification using only the distribution
property is not sufficient to define distributed systems and many existing computing
systems can satisfy this property. Consider a collection of terminals attached to a
mainframe or an I/O processor within a mainframe. A definition that is based solely on
the physical distribution of some components of the system does not capture the essence
of a distributed system. A proper definition must also take into consideration component
types and how they interact.
• Distributed System Components. Enslow [Comer, 1991] presents a “research and
development” definition of distributed systems that identifies five components of such a
system. First, the system has a multiplicity of general-purpose resource components,
including both hardware and software resources, that can be assigned to specific tasks on
a dynamic basis. Second, there is a physical distribution of the hardware and software
resources of the system that interact through a communications network. Third, a highlevel operating system that unifies and integrates the control of the distributed system.
Fourth, system transparency that permits services to be requested by name only. Lastly,
cooperative autonomy that characterizes the operation of both hardware and software
resources.
• Transparency Property. Other researchers emphasize the transparency property of
the system and the degree to which it looks like a single integrated system to users and
applications. Transparency is defined as the technique used to hide the separation from
both the users and the application programs so that the system is perceived as one single
system rather than a collection of computers. The transparency property is provided by
the software structure overlaying the distributed hardware. Tanenbaum and VanRenesse
used this property to define a distributed system as one that looks to its users like an
ordinary centralized system, but runs on multiple independent computers [Bagley, 1993;
Halsall, 1992] The authors of ANSA reference Manual [Borghoff, 1992] defined eight
different types of transparencies:
1. Access Transparency: This property allows local and remote files and other objects to
be accessed using the same set of operations.
2. Location Transparency: This property allows objects to be accessed without knowing
their physical locations.
5
High Performance Distributed Computing
3. Concurrency Transparency: This property enables multiple users or distributed
applications to run concurrently without any conflict; the users do not need to write any
extra code to enable their applications to run concurrently in the system.
4. Replication Transparency: This property allows several copies of files and
computations to exist in order to increase reliability and performance. These replicas are
invisible to the users or application programs. The number of redundant copies can be
selected dynamically by the system, or the user could specify the required number of
replicas.
5. Failure Transparency: This property allows the system to continue its operation
correctly in spite of component failures; i.e. it enables users and distributed applications
to run for completion in spite some failures in hardware and/or software components
without modifying their programs.
6. Migration Transparency: This property allows system components (processes,
threads, applications, files, etc.) to move within the system without affecting the
operation of users or application programs. This migration is triggered by the system
software in order to improve system and/or application desired goals (e.g., performance,
fault tolerance, security).
7. Performance Transparency: This property provides the system with the ability to
dynamically balance its load and schedule the user applications (processes) in a
transparent manner to the users in order to optimize the system performance and/or
application performance.
8. Scaling Transparency: This property allows the system to be expanded or shirked
without changing the system structure or modifying the distributed applications
supported by the system.
In this book, we assume a distributed computing system resources might include a wide
range of computing resources such as workstations, PC's, minicomputers, mainframes,
supercomputers, and other special purpose hardware units. The underlying network
interconnecting the system resources can span LAN's, MAN's and even WAN's, can have
different topologies (e.g., bus, ring, full connectivity, random interconnect, etc.), and can
support a wide range of communication protocols. In high performance distributed
computing environments, computers communicate and cooperate with latency and
throughput comparable to that experienced in tightly coupled parallel computers. Based
on these properties, we define a distributed system as a networked (loosely coupled)
system of independent computing resources with adequate software structure to enable
the integrated use of these resources toward a common goal.
1.3 Evolution of Distributed Computing Systems
6
High Performance Distributed Computing
Distributed computing systems have been evolving for more than two decades and this
evolution could be described in terms of four generations of distributed systems: Remote
Execution Systems (RES), Distributed Computing Systems (DCS), and High
Performance Distributed Systems (Grid computing systems), and Autonomic Computing.
Each generation can be distinguished by the type of computers, the communications
networks, the software environments and applications that are typically used in that
generation.
1.3.1 Remote Execution Systems (RES): First Generation
The first generation spans the 1970's era, a time when the first computer networks were
being developed. During this era, computers were large and expensive. Even
minicomputers would cost tens of thousands of dollars. As a result, most organizations
had only a handful of computers that were operated independently from one another and
were located in one centralized computing center. The computer network concepts were
first introduced around the mid 1970s and the initial computer network research was
funded by the federal government. For example, the Defense Advanced Research
Projects Agency (DARPA) has funded many pioneered research projects in packet
switching including the ARPANET. The ARPANET used conventional point-to-point
leased line interconnection. Experimental packet switching over radio networks and
satellite communication channels were also conducted during this period. The
transmission rate of the networks was slow (typically in the 2400 to 9600 bit per second
(bps) range). Most of the software available to the user for information exchange was in
providing the capability of terminal emulation and file transfer. Consequently, most of
the applications were limited to remote login capability, remote job execution and remote
data entry.
1.3.2 Distributed Computing Systems (DCS): Second Generation
This era spans approximately the 1980s, where significant advances occurred in the
computing, networking and software resources used to design distributed systems. In this
period, the computing technology introduced powerful microcomputer systems capable
of providing computing power comparable to that of minicomputers and even
mainframes at a much lower price. This made microcomputers attractive to design
distributed systems that have better performance, reliability, fault tolerance, and
scalability than centralized computing systems.
Likewise, network technology improved significantly during the 1980s. This was
demonstrated by the proliferation and the availability of high-speed local area networks
that were operating at 10 and 100 million bits per second (Mbps) (e.g., Ethernet and
FDDI). These systems allowed dozens, even hundreds, of computers that varied from
mainframes, minicomputers, and workstations to PCs to be connected such that
information could be transferred between machines in the millisecond range. The wide
area network’s speed was slower than LAN’ speed and it was in the 5600 bps to 1.54
Mbps range.
7
High Performance Distributed Computing
During this period, most of the computer systems were running the Berkeley UNIX
operating system (referred to as the BSD UNIX) that was developed at the University of
California's Berkeley Software Distribution. The Berkeley software distribution became
popular because it was integrated with the Unix operating system and also offered more
than the basic TCP/IP protocols. For example, in addition to standard TCP/IP
applications (such as ftp, telnet, and mail), BSD UNIX offered a set of utilities to run
programs or copy files to or from remote computers as if they were local (e.g., rcp, rsh,
rexe.). Using the Berkeley socket, it was possible to build distributed application
programs that were running over several machines. However, the users had to be aware
of the activities on each machine; that is where each machine was located, and how to
communicate with it. No transparency or support was provided by the operating system;
the responsibility for creating, downloading and running the distributed application was
solely performed by the programmer.
However, between 1985-1990, a significant progress in distributed computing software
tools was achieved. In fact, many of the current message passing tools were introduced
during this period such as Parallel Virtual Machine (PVM) developed at Oak ridge
National Laboratory, ISIS developed by ISIS/Cornell University, Portable Programming
for Parallel Processing (P4), just to name a few. These tools have contributed
significantly to the widespread of distributed systems and their applications. With these
tools, a large number of specialized distributed systems and applications were deployed
in the office, medical, engineering, banking and military environments. These distributed
applications were commonly developed based on the client-server model.
1.3.3 High-Performance Distributed Systems: Third Generation
This generation will span the 19
90's, which will be the decade where parallel and distributed computing will be unified
into one computing environment that we refer to in this book as high performance
distributed system. The emergence of high-speed networks and the evolution of processor
technology and software tools will hasten the proliferation and the development of high
performance distributed systems and their applications.
The existing computing technology has introduced processor chips that are capable of
performing billions of floating point operations per second (Gigaflops) and are swiftly
moving towards the trillion floating point operation per second (Teraflops) goal. A
current parallel computer like the IBM ASCI White Pacific Computer at Lawrence
Livermore National Laboratory in California can computer 7 trillion math operations a
second. Comparable performance can now be achieved in high performance distributed
systems. For example, a Livermore cluster contains, 2,304 2.4-GHz Intel Pentium Xeon
processors have a theoretical peak speed of 11 trillion floating-point operations per
second. In HPDS environment, the computing resources will include several types of
computers (supercomputers, parallel computers, workstations and even PC's) that
collectively execute the computing tasks of one large-scale application.
8
High Performance Distributed Computing
Similarly, the use of fiber optics in computer networks has stretched the transmission
rates from 64 Kilobit per second (Kbps) in the 1970s to over 100 Giga bit per second
(Gbps) as shown in Figure 1.2. Consequently, this has resulted in a significant reduction
in the transmission time of data between computers. For example, it took 80 seconds to
transmit a 1 Kbyte data over a 100 Kbps network, but it now takes only 8 milliseconds to
transmit the same sized data over a 1 Gbps network. Furthermore, the current movement
towards the standardization of terabit networks will make high-speed networks attractive
in the development of high performance distributed systems.
DWDM
1 Tbit/sec
Terabit network
Network
Bandwidth
(Mbit/sec)
10000
Gigabit Networks
1000
FDDI,DQDB
100
Token Rings
Ethernet
10
1980
1985
Figure1.2
1990
2000
2002
2003 and beyond
Evolution of network technology
The software tools used to develop HPDS applications make the underlying computing
resources, whether they are parallel or distributed, transparent to the application
developers. The same parallel application can run without any code modification in
HPDS. Software tools generally fall into three groups on the basis of the service they
provide to the programmer. The first class attempts to hide the parallelism from the user
completely. These systems consist of parallelizing and vectorizing compilers that exploit
the parallelism presented by loops and have been developed mainly for vector computers.
The second approach uses shared memory constructs as a means for applications to
interact and collaborate in parallel. The third class requires the user to explicitly write
parallel programs by message passing. During this period, many tools have been
developed to assist the users developing parallel and distributed applications at each
stage of the software development life cycle. For each stage, there exist some tools to
assist the users with the activities of that stage.
The potential application examples cover parallel and distributed computing, national
multimedia information server (e.g., national or international multimedia yellow pages
server), video-on-demand and computer imaging, just to name a few [Reed, and
Fujimoto, 1987]. The critical performance criteria for the real-time distributed
applications require extremely high bandwidth and strict requirements on the magnitude
and variance of network delay. Another important class of applications that require high
performance distributed systems is the National Grand Challenge problems. These
9
High Performance Distributed Computing
problems are characterized by massive data sets and complex operations that exceed the
capacity of current supercomputers. Figure 1.3 shows the computing and storage
requirements for the candidate applications for high performance distributed systems.
M emory Capacity
16TB
Neuro-science
applications
12.3TB
Governmental filling
3D forging/welding
8TB
1000 GB
100 GB
10 GB
Global change
Human Genome
Fluid Turbulence
vehicle Dynamics
Ocean Circulation
Viscous Fluid Dynamics
Superconductor modeling
Semiconductor M odeling
Quantum Chromodynamics
Vision
Vehicle
signature
72-Hour
Weather
1 GB
100 M B
3D Plasma
M odeling
Component
deterioration model
Crash/fire safety
3D microstructure
Structural
Biology
Nuclear
applications
Prototype 3D
physics
Pharmaceutical
Design
M icro aging
Chemical Dynamics
System Speed
1991
10 Gflops
1993
100 Gflops
1995
1997
1 Tflops
8Tflop
2000
12Tflops
2002
50 Tflops
2005-2009
256-1000
Figure 1.3 Computing and Storage Requirements of HPDS Applications
1.3.3 Autonomic Computing Systems: Fourth Generation
The autonomic computing concept was introduced in the early 2000 by IBM
[www.research.ibm.com/autonomic]. The basic approach is to build computing systems
that are capable of managing themselves; that can anticipate their workloads and adapt
their resources to optimize their performance. This approach has been inspired by the
human autonomic nervous system that has the ability to self-configure, self-tune and even
repair himself without any human conscience involvement. The resources of autonomic
systems include a wide range of computing systems, wireless and Internet devices. The
applications cover a wide range of applications that touch all aspects of our life such as
education, business, government and defense. The field is still in its infancy and it is
expected to play an important role in defining the next era of computing.
10
High Performance Distributed Computing
Table 1.1 summarizes the main features that characterize computing and network
resources, the software support, and applications associated with each distributed system
generation.
Table 1.1 Evolutions of Distributed Computing Systems
Distributed System
Generation
Remote Execution
Systems (RES)
Distributed
Computing Systems
(DCS)
High-performance
Distributed Systems
(HPDC)
Autonomic
Computing
Computing resources
Networking resources
Software/ Application
Mainframe,
Minicomputers;
Centralized;
Few and expensive
Workstation & PCs
Mainframe,
Minicomputers;
Distributed;
Not expensive
Packet switched
networks; Slow WAN
(2400-9600bps); Few
networks
Fast LANs & MANs,
100Mbps, FDDI,
DQDB, ATM;
Fast WANs (1.5Mbps)
Large number
Terminal emulation;
Remote login;
Remote data entry
Workstations & PCs;
Parallel/super
computers;
Fully distributed
Explosive number
Computers of all types
(PCs, Workstations,
Parallel/supercomputers
), cellular phone and
Internet devices
High-speed LANs,
MANs, WANs;
ATM, Gigabit Ethernet;
Explosive number
Fluid turbulence;
Climate modeling;
Video-On-Demand;
Parallel/Distributed
Computing
Business applications,
Internet services,
scientific, medical and
engineering applications,
High-speed LANs,
MANs, WANs;
ATM, Gigabit Ethernet;
wireless and mobile
networks
Network File Systems
(NFS);
Message-passing tools
(PVM,P4 ,ISIS);
On-line transaction
systems
-Airline reservation, online Banking
1.4 Promises and Challenges of High Performance Distributed Systems
The proliferation of high performance workstations and the emergence of high-speed
networks (Terrabit networks) have attracted a lot of interest in high performance
distributed computing. The driving forces towards this end will be (1) the advances in
processing technology, (2) the availability of high speed networks and (3) the increased
research directed towards the development of software support and programming
environments for distributed computing. Further, with the increasing requirements for
computing power and the diversity in the computing requirements, it is apparent that no
single computing platform will meet all these requirements. Consequently, future
computing environments need to adaptively and effectively utilize the existing
heterogeneous computing resources. Only high performance distributed systems provide
the potential of achieving such an integration of resources and technologies in a feasible
manner while retaining desired usability and flexibility. Realization of this potential
requires advances on a number of fronts-- processing technology, network technology
and software tools and environments.
11
High Performance Distributed Computing
1.4.1 Processing Technology
Distributed computing relies to a large extent on the processing power of the individual
nodes of the network. Microprocessor performance has been growing at a rate of 35--70
percent during the last decade, and this trend shows no indication of slowing down in the
current decade. The enormous power of the future generations of microprocessors,
however, cannot be utilized without corresponding improvements in the memory and I/O
systems. Research in main-memory technologies, high-performance disk-arrays, and
high-speed I/O channels are therefore, critical to utilize efficiently the advances in
processing technology and the development of cost-effective high performance
distributed computing.
1.4.2 Networking Technology
The performance of distributed algorithms depends to a large extent on the bandwidth
and latency of communication among the network nodes. Achieving high bandwidth and
low latency involves not only fast hardware, but also efficient communication protocols
that minimize the software overhead. Developments in high-speed networks will, in the
future, provide gigabit bandwidths over local area networks as well as wide area
networks at a moderate cost, and thus increasing the geographical scope of high
performance distributed systems.
The problem of providing the required communication bandwidth for distributed
computational algorithms is now relatively easy to solve, given the mature state of fiberoptics and opto-electronic device technologies. Achieving the low latencies necessary,
however, remains a challenge. Reducing latency requires progress on a number of fronts:
First, current communication protocols do not scale well to a high-speed environment.
To keep latencies low, it is desirable to execute the entire protocol stack, up to the
transport layer, in hardware. Second, the communication interface of the operating
system must be streamlined to allow direct transfer of data from the network interface to
the memory space of the application program. Finally, the speed of light (approximately
5 microseconds per kilometer) poses the ultimate limit to latency.
In general, achieving low latency requires a two-pronged approach:
1. Latency Reduction: Minimize protocol-processing overhead by using streamlined
protocols executed in hardware and by improving the network interface of the operating
system.
2. Latency Hiding: Modify the computational algorithm to hide latency by pipelining
communication and computation.
These problems are now perhaps most fundamental to the success of high-performance
distributed computing, a fact that is increasingly being recognized by the research
community.
12
High Performance Distributed Computing
1.4.3 Software Tools and Environments
The development of high performance distributed applications is a non-trivial process
and requires a thorough understanding of the application and the architecture. Although,
an HPDS provides the user with enormous computing power and a great deal of
flexibility, this flexibility implies increased degrees of freedom which have to be
optimized in-order to fully exploit the benefits of the distributed system. For example,
during software development, the developer is required to select the optimal hardware
configuration for the particular application, the best decomposition of the problem on the
selected hardware configuration, the best communication and synchronization strategy to
be used, etc. The set of reasonable alternatives that have to be evaluated in such an
environment that is very large and selecting the best alternative among these is a nontrivial task. Consequently, there is a need for a set of simple and portable software
development tools which can assist the developer in appropriately distributing the
application computations to make efficient use of the underlying computing resources.
Such a set of tools should span the software life-cycle and must support the developer
during each stage of application development starting from the specification and design
formulation stages through the programming, mapping, distribution, scheduling phases,
tuning and debugging stages up to the evaluation and maintenance stages.
1.5 Summary
Distributed systems applications and deployment have been growing at a fast pace to
cover many fields, such as education, industry, finance, medicine and military.
Distributed systems have the potential to offer many benefits when compared to
centralized computing systems that include increased performance, reliability and fault
tolerance, extensibility, cost-effectiveness, and scalability. However, designing
distributed systems is more complex than designing a centralized system because of the
asynchronous behavior and the complex interaction of their components, heterogeneity,
and the use of communication network for their information exchange and interactions.
Distributed systems provide designers with many options to choose from and poor
designs might lead to poorer performance than centralized systems.
Many researchers have studied distributed systems and used different names and features
to characterize them. Some researchers used the logical unit concept to organize and
characterize distributed systems, while others used multiplicity, distribution, or
transparency. However, there is a growing consensus to define a distributed system as a
collection of resources and/or services that are interconnected by a communication
network and these resources and services collaborate to provide an integrated solution to
an application or a service.
Distributed systems have been around for approximately three decades and have been
evolving since their inception in the 1970s. One can describe their evolution in terms of
13
High Performance Distributed Computing
four generations: Remote Execution Systems, Distributed Computing Systems, High
Performance Distributed Systems, and Autonomic Computing. Autonomic Computing
Systems and High performance distributed systems will be the focus of this book. They
utilize efficiently and adaptively a wide range of heterogeneous computing resources,
networks and software tools and environments. . These systems will change their
computing environment dynamically to provide the computing, storage, and connectivity
for large scale applications encountered in business, finance, health care, scientific and
engineering fields.
1.6 PROBLEMS
1. Many claims have been attributed to distributed systems since their inception.
Enumerate all these claims and then explain which of these claims can be achieved
using the current technology and which ones will be achieved in the near future, and
which will not be possible at all.
2. What are the features or services that could be used to define and characterize any
computing system? Use these features or properties to compare and contrast the
computing systems built on the basis of
• Single operating system in a parallel computer
• Network operating system
• Distributed system
• High performance distributed system.
3. Describe the main advantages and disadvantages of distributed systems when they are
compared with centralized computing systems.
4. Why is it difficult to design general-purpose reliable distributed systems?
5. What are the main driving forces toward the development of high performance
distributed computing environment. Describe four classes of applications that will be
enabled by high performance distributed systems. Explain why these applications
could not run on second-generation distributed systems.
6. What are the main differences between distributed systems and high performance
distributed systems?
7. Investigate the evolution of distributed systems and study their characteristics and
applications. Based on this study, can you identify the types of applications and any
additional features associated with each generation of distributed systems as
discussed in Section 3?
8. What are the main challenges facing the design and development of large-scale high
performance distributed systems that have 100,000 of resources and/or services?
14
High Performance Distributed Computing
Discuss any potential techniques and technologies that can be used to address these
challenges.
15
High Performance Distributed Computing
References
1. Mullender, S., Distributed Systems, First Edition, Addison-Wesley, 1989.
2. Mullender, S., Distributed Systems, Second Edition, Addison-Wesley, 1993.
3. Patterson and J. Hennessy, Computer Organization Design: the hardware/software
interface, Morgan Kaufamm Publishers, 1994.
4. Liebowitz B.H., and Carson, J.H., ``Multiple Processor Systems for Real-Time
Applications'', Prentice-Hall, 1985.
5. Umar, A., Distributed Computing, PTR Prentice-Hall, 1993.
6. Enslow, P.H., “What is a “Distributed”Data Processing System?”, IEEE Computer,
January 1978.
7. Kleinrock, L., ``Distributed Systems'', Communications of the ACM, November
1985.
8. Lorin, H., ``Aspects of Distributed Computer Systems'', John-Wiley and Sons, 1980.
9. Tannenbaum, A.S., Modern Operating Systems, Prentice-Hall, 1992.
10. ANSA 1997, ANSA Reference Mnaual Release 0.03 (Draft), Alvey Advanced
Network Systems Architectures Project, 24 Hills Road, Cambridge CB2 1JP, UK.
11. Bell, G.,``Ultracomputer A Teraflop Before its Time'', Communications of the ACM,
pp 27-47, August 1992.
12. Geist, A., ``PVM 3 User's Guide and Reference Manual'', Oak Ridge National
Laboratory, 1993.
13. Birman, K. K. Marzullo, ``ISIS and the META Project'', Sun Technology, Summer
1989.
14. Birman, K., et al ISIS User Guide and Reference Manual, Isis Distributed Systems,
Inc, 111 South Cayuga St., Ithaca NY, 1992.
15. Spragins, J.D., Hammond, J.L., and Pawlikowski, K., ``Telecommunications
Protocols and Design'', Addison Wesley, 1991.
16
High Performance Distributed Computing
16. McGlynn, D.R., ``Distributed Processing and Data Communications'', John Wiley
and Sons, 1978.
17. Tashenberg, C.B., ``Design and Implementation of Distributed-Processing Systems'',
American Management Associations, 1984.
18. Hwang, K., and Briggs, F.A., ``Computer Architecture and Parallel Processing'',
McGraw-Hill, 1984.
19. Halsall, F., Data Communications, Computer Networks and Open Systems'', Third
Edition, Addison-Wesley, 1992.
20. Danthine, A., and Spaniol, O., ``High Performance Networking, IV'', International
Federation for Information Processing, 1992.
21. Borghoff, U.M., ``Catalog of Distributed File/Operating Systems'', Springer-Verlag,
1992.
22. LaPorta, T.F., and Schwartz, M., ``Architectures, Features, and Implmentations of
High-Speed Transport Protocols'', IEEE Network Magazine'', May 1991.
23. Kung, H.T., ``Gigabit Local Area Networks:
Communications Magazine'', April 1992.
A systems perspective'', ``IEEE
24. Comer, D.E., Internetworking with TCP/IP, Volume I, Prentice-Hall. 1991.
25. Tannenbaum, A.S., Computer Networks Prentice-Hall, 1988.
26. Coulouris, G.F., Dollimore, J., Distributed Systems: Concepts and Design, AddisonWesley, 1988.
27. Bagley, ``Dont't have this one'', October 1993.
28. Stankovic, J.A., ``A Perspective on Distributed Computer Systems'', IEEE
Transactions on Computers, December 1984.
29. Andrews, G., ``Paradigms for Interaction in Distributed Programs'', Computing
Surveys, March 1991.
30. Chin, R. S. Chanson, ``Distributed Object Based Programming Systems'', Computing
Surveys, March 1991.
31. The Random House College Dictionary, Random House, 1975.
32. Shatz, S., Development of Distributed Software, Macmillan, 1993.
17
High Performance Distributed Computing
33. Jain, N. and Schwartz, M. and Bashkow, T. R., ``Transport Protocol Processing at
GBPS Rates'', Proceedings of the SIGCOMM Symposium on Communication
Architecture and Protocols, August 1990.
34. Reed, D.A., and Fujimoto, R.M., ``Multicomputer Networks Message-Based Parallel
Processing'', MIT Press, 1987.
35. Maurice, J.B., ``The Design and Implementation of the UNIX Operating System'',
Prentice-Hall, 1986.
36. Ross, `` An overview of FDDI: The Fiber Distributed Data Interface,'' IEEE Journal
on Selected Areas in Communications, pp. 1043--1051, September 1989.
18
High Performance Distributed Systems
Chapter 2
Distributed System Design Framework
Objective of this chapter:
Chapter two will present a design methodology model of distributed systems
to simplify the design and the development of such systems. In addition, we
provide an overview of all the design issues and technologies that can be used
to build distributed systems.
Key Terms
Network, protocol, interface, Distributed system design, WAN, MAN, LAN, LPN,
DAN, Circuit switching, Packet Switching, Message Switching, Server Model, Pool
Model, Integrated Model, Hybrid Model.
2.1 Introduction
In Chapter 1, we have reviewed the main characteristics and services provided by distributed
systems and their evolution. It is clear from the previous chapter that there are a lot of confusions
on what constitute a distributed system, its main characteristics and services, and their designs. In
this chapter, we present a framework
that can be used to identify the design principles and the
technologies to implement the components of any distributed computing system. We refer to this
framework as the Distributed System Design Model (DSDM). Generally speaking, the design
process of a distributed system involves three main activities: 1) designing the communication
network that enables the distributed system computers to exchange information, 2) defining the
system structure (architecture) and the services that enable multiple computers to act and behalf as
a system rather than a collection of computers, and 3) defining the distributed programming
techniques to develop distributed applications. Based on this notion of the design process, the
Distributed System Design Model can be described in terms of three layers: (see Figure 2.1): 1)
Network, Protocol, and Interface (NPI) layer; 2) System Architecture and Services (SAS) layer;
and 3) Distributed Computing Paradigms (DCP) layer. In this chapter, we describe the
functionality and the design issues that must be taken into consideration during the design and
implementation of each layer. Furthermore, we organize the book chapters into three parts where
each part corresponds to one layer in the DSDM.
1
High Performance Distributed Systems
D istrib u ted C o m p u tin g P arad ig m s
C o m p u tatio n M o d els
F u n ctio n al P arallel
D ata P arallel
C om m u n ication M od e ls
M essag e P assing
Sh ared M em o ry
Sy stem A rch itectu re an d S erv ices (SA S )
A rch itectu re M o d els
S y stem L ev el S erv ices
N etw o rk , P ro to co l an d In terface
N etw o rk N etw o rk s
C o m m u n icatio n Pro to co ls
Figure 2.1. Distributed System Design Model.
The communication network, protocol and interface (NPI) layer describes the main
components of the communication system that will be used for passing control and
information among the distributed system resources. This layer is decomposed into
three sub-layers: Network Types, Communication Protocols, and Network
Interfaces.
The distributed system architecture and service layer (SAS) defines the architecture
and the system services (distributed file system, concurrency control, redundancy
management, load sharing and balancing, security service, etc.) that must be
supported and supported in order for the distributed system to behave and function
as if it were a single image computing system.
The distributed computing paradigms (DCP) layer represents the programmer (user)
perception of the distributed system. This layer focuses on the programming
paradigms that can be used to develop distributed applications. Distributed
computing paradigms can be broadly characterized based on the computation and
communication models. Parallel and distributed computations can be described in
terms of two paradigms: Functional Parallel and Data Parallel paradigms. In
functional parallel paradigm, the computations are assigned to different computers.
In data parallel paradigm, all the computers perform the same functions, Same
Program Multiple Data Stream (SPMD), but each function operates on different
data streams. One can also characterize parallel and distributed computing based on
the techniques used for inter-task communications into two main models: Message
Passing and Distributed Shared Memory models. In message passing paradigm,
tasks communicate with each other by messages, while in distributed shared
memory they communicate by reading/writing to a global shared address space.
In the following subsections, we describe the design issues and technologies
associated with each layer in the DSDM.
2
High Performance Distributed Systems
2.2 Network, Protocol and Interface
The first layer (from the bottom-up) in the distributed system design model
addresses the issues related to designing the computer network, communications
protocols, and host interfaces. The communication system represents the underlying
infrastructure used to exchange data and control information among the logical and
physical resources of the distributed system. Consequently, the performance and the
reliability of distributed system depend heavily on the performance and reliability
of the communication system.
Traditionally distributed computing systems have relied entirely on local area
networks to implement the communication system. Wide area networks were not
considered seriously because of their high-latency and low-bandwidth. However,
the current emerging technology has changed that completely. Currently, the WAN
operate at Terabit per second transmission rates (Tbps) as shown in Figure 1.2.
A communication system can be viewed as a collection of physical and logical
components that jointly perform the communication tasks. The physical
components (network devices) transfer data between the host memory and the
communication medium. The logical components provide services for message
assembly and/or de-assembly, buffering, formatting, routing and error checking.
Consequently, the design of a communication system involves defining the
resources required to implement the functions associated with each component. The
physical components determine the type of computer network to be used (LAN's,
MAN's, WAN's), type of network topology (fully connected, bus, tree, ring,
mixture, and random), and the type of communication medium (twisted pair,
coaxial cables, fiber optics, wireless, and satellite), and how the host accesses the
network resources. The logical components determine the type of communication
services (packet switching, message switching, circuit switching), type of
information (data, voice, facsimile, image and video), management techniques
(centralized and/or distributed), and type of communication protocols.
The NPI layer discusses the design issues and network technologies available to
implement the communication system components using three sub-layers: 1)
Network Type that discusses the design issues related to implement the physical
computer network, 2) Communications Protocols that discusses communication
protocols designs and their impact on distributed system performance, and 3) Host
Network Interface that discusses the design issues and techniques to implement
computer network interfaces.
2.2.1 Network Type
A computer network is essentially any system that provides communication
between two or more computers. These computers can be in the same room, or can
be separated by several thousands of miles. Computer networks that span large
geographical distances are fundamentally different from those that span short
3
High Performance Distributed Systems
distances. To help characterize the differences in capacity and intended use,
communications networks are generally classified according to the distance into
five categories: 1) Wide Area Network (WAN), 2) Metropolitan Area Network
(MAN), 3) Local Area Network (LAN), 4) Local Peripheral Network (LPN), and 5)
Desktop Area Network (DAN).
Wide area networks (WANs): WANs are intended for use over large distances that
could include several national and international private and/or public data networks.
There are two types of WANs: Packet switched and Circuit Switched networks.
WANs used to operate at slower speeds (e.g., 1.54 Mbps) than LAN technologies
and have high propagation delays. However, the recent advances in fiber optical
technology, wavelength division multiplexing technology and the wide deployment
of fiber optics to implement the backbone network infrastructure have made their
transmission rates higher than the transmission rates of LANs. In fact, it is now
approaching Pita transmission rates (Pbps).
Metropolitan area networks (MANs): MANs span intermediate distances and
operate at medium-to-high speeds. As the name implies, a MAN can span a large
metropolitan area and may or may not use the services of telecommunications
carriers. MANs introduce less propagation delay than WANs and their transmission
rates range from 56Kbps to 100 Mbps.
Local area networks (LANs): LANs normally used to interconnect computers and
different types of data terminal equipment within a single building or a group of
buildings or a campus area. LANs provide the highest speed connections (e.g., 100
Mbps, 1 Gbps) between computers because it covers short distances than those
covered by WANs and MANs. Most LANs use a broadcast communication medium
where each packet is transmitted to all the computers in the network.
Local peripheral networks (LPNs): LPNs can be viewed as special types of LANs
[Tolmie and Tanlawy, 1994; Stallings et al, 1994] and it covers an area of a room or
a laboratory. LPN is mainly used to connect all the peripheral devices (disk drives,
tape drives, etc.) with the computers located in that room or laboratory.
Traditionally, input/output devices are confined to one computer system. However,
the use of high speed networking standards (e.g., HIPPI and Fiber Channels) to
implement LPNs has enabled the remote access to the input/output devices.
Desktop area networks (DANs): DAN is another interesting concept that aims at
replacing the proprietary bus within a computer by a standard network to connect
all the components (memory, network adapter, camera, video adapter, sound
adapter, etc.) using a standard network. The DAN concept is becoming even more
important with the latest development in palm computing devices; the palm
computers do not need to have huge amount of memory, sound/video capabilities,
all these can be taken from the servers that can be connected to the palm devices
using a high speed communication link.
4
High Performance Distributed Systems
Network Topology
The topology of a computer network can be divided into five types: bus, ring, hub,
fully connected and random as shown in Figure 2.2.
a . B U S N e tw o r k
b . R IN G N e tw o r k
s w itc h
c . H u b -b a se d N e tw o rk
d . F u lly c o n n e c te d
s w itc h
e. R andom
Figure 2.2 Different Types of Network Topologies
In a bus-based network, the bus is time-shared among the computers connected to
the bus. The control of the bus is either centralized or distributed. The main
limitation of the bus topology is its scalability; when the number of computers
sharing the bus becomes large, the contention increases significantly that lead to
unacceptable communication delays. In a ring network, the computers are
connected using point-to-point communication links that form a closed loop. The
main advantages of the ring topology include simplified routing scheme, fast
connection setup, a cost proportional to the number of interfaces, and provide high
throughput [Weitzman, 1980; Halsall, 1992]. However, the main limitation of the
ring topology is its reliability, which can be improved by using double rings. In a
hub-based or switched-based network, there is one central routing switch that
connects an incoming message on one of its input links to its destination through
one of the switch output links. This topology can be made hierarchical where a
slave switch can act as a master switch for another cluster and so on. With the rapid
deployment of switched-based networks (e.g., Gigabit Ethernet), this topology is
expected to play an important role in designing high performance distributed
systems. In fully connected network, every computer can reach any other computer
in one hob. However, the cost is prohibitively especially when the number of
computers to be connected is large. The Random network is a type of network
5
High Performance Distributed Systems
topology that is a combination of the other types that will lead to an ad-hoc
topology.
Network Service
Computer networks can also be classified according to the switching mechanism
used to transfer data within the network. These switching mechanisms can be
divided into three basic types: Circuit switching, Message switching and Packet
switching.
Circuit Switching: Conceptually, circuit switching is similar to the service offered
by telephony networks. The communication service is performed in three phases:
connection setup, data transmission, and connection release. Circuit-switched
connections are reliable and deliver data in the order it was sent. The main
advantage of circuit switching is its guaranteed capacity; once the circuit is
established, no other network activity is allowed to interfere with the transmission
activity and thus can not decrease the capacity of the circuit. However, the
disadvantage of Circuit switching is the cost associated with circuit setup and
release and the low utilization of network resources.
Message Switching: In a message switching system, the entire message is
transmitted along a predetermined path between source and destination computers.
The message moves in a store-and-forward manner from one computer to another
until it reaches its destination. The message size is not fixed and it could vary from
few kilobytes to several megabytes. Consequently, the intermediate communication
nodes should have enough storage capacity to store the entire message as being
routed to its destination. Message switching could result in long delays when the
network traffic is heavy and consist of many long messages. Furthermore, the
resource utilization is inefficient and it provides limited flexibility to adjust to
fluctuations in network conditions [Weitzman, 1980].
Packet Switching: In this approach, messages are divided into small fixed size
pieces, called packets that are multiplexed onto the communications links. A
packet, which usually contains only a few hundred bytes of data, is divided into two
parts: data and header parts. The header part carries routing and control
information that is used to identify the source computer, packet type, and the
destination computer; this service is similar to the postal service. Users place mail
packages (packets) into the network nodes (mailboxes) that identify the source and
the destination of the package. The postal workers then use whatever paths they
deem appropriate to deliver the package. The actual path traveled by the package is
not guaranteed. Like the postal service, a packet-switched network uses best-effort
delivery. Consequently, there is no guarantee that the packet will ever be delivered.
Also there are typically several intermediate nodes between the source and the
destination that will store and forward the packets. As a result the packets sent from
one source may not take the same route to the destination, nor may they be
delivered in the same transmission order, and they may be duplicated. The main
6
High Performance Distributed Systems
advantage of Packet switching is that the communication links are shared by all the
network computers and thus improve the utilization of the network resources. The
disadvantage is that as network activity increases, each machine sharing the
connection receives less of the total connection capacity, which results in a slower
communication rate. Furthermore, there is no guarantee that the packets will be
received in the same order of their transmission or without any error or duplication.
The main difference between circuit switching and packet switching is that in
circuit switching there is no need for intermediate network buffer. Circuit switching
provides a fast technique to transmit large data, while packet switching is useful to
transmit small data blocks between a random number of geographically dispersed
users. Another variation of these services is the virtual circuit switching which
combines both packet and circuit switching in its service. The communication is
done by first establishing the connection, transferring data, and finally
disconnecting the connection. However, during the transmission phase, the data is
transferred as small packets with headers that define only the virtual circuit these
packets are using. In this service, we have the advantages of both circuit switching
and packet switching. In fact, the virtual circuit switching is the service adopted in
ATM networks where packets are referred to as cells as will be discussed later in
Chapter 3.
2.2.2 Communication Protocols
A protocol is a set of precisely defined rules and conventions for communication
between two parties. A communication protocol defines the rules and conventions
that will be used by two or more computers on the network to exchange
information. In order to manage the complexity of the communication software, a
hierarchy of software layers is commonly used for its implementation. Each layer of
the hierarchy is responsible for a well-defined set of functions that can be
implemented by a specific set of protocols. The Open Systems Interconnection
(OSI) reference model, which is proposed by the International Standards
Organization (ISO), has seven layers as shown in Figure 2.3. In what follows, we
briefly describe the functions of each layer of the OSI reference model from the
bottom-up [Jain, 1993].
7
High Performance Distributed Systems
Application Layer
Application
Component
Presentation Layer
Session Layer
Transport Layer
Transport
Component
Network Layer
Data Link Layer
Network
Component
Physical Layer
Figure 2.3 The OSI reference model
Physical Layer: It is concerned with transmitting raw bits over a communication
channel. Physical layer recognizes only individual bits and cannot recognize
characters or multi-character frames. The design issues here largely deal with
mechanical, electrical, procedural interfaces and physical transmission medium.
The physical layer consists of the hardware that transmits sequences of binary data
by analog or digital signaling and using either electric signals, light signals or
electro-magnetic signals.
Data Link Layer: It defines the functional and procedural methods to transfer data
between two neighboring communication nodes. This layer includes mechanisms to
deliver data reliably between two adjacent nodes, group bits into frames and to
synchronize the data transfer in order to limit the flow of bits from the physical
layer. In local area networks, the data link layer is divided into two sublayers: the
medium access control (MAC) sub-layer, which defines how to share the single
physical transmission medium among multiple computers and the logical link
control (LLC) sub-layer that defines the protocol to be used to achieve error control
and flow control. LLC protocols can be either bit or character based protocols.
However, most of the networks use bit-oriented protocols [Tanenbaum, 1988].
Network Layer: The network layer addresses the routing scheme to deliver packets
from the source to the destination. This routing scheme can be either static (the
routing path is determined a priori) or dynamic (the routing path is determined
based on network conditions). Furthermore, this layer provides techniques to
prevent and remove congestion once it occurs in the network; congestion occurs
when some nodes receive more packets than they can process and route. In wide
area networks, where the source and destination computers could be interconnected
by different types of networks, the network layer is responsible for internetworking;
that is converting the packets from one network format to another. In a single local
area network with broadcast medium, the network layer is redundant and can be
8
High Performance Distributed Systems
eliminated since packets can be transmitted from any computer to any other
computer by just one hop [Coulouris and Dollimore, 1988]. In general, the network
layer provides two types of services to the transport layer: connection-oriented and
connectionless services. The connection oriented service uses circuit switching
while the connectionless service uses the packet switching technique.
Transport Layer: It is an end-to-end layer that allows two processes running on two
remote computers to exchange information. The transport layer provides to the
higher-level processes efficient, reliable and cost-effective communication services.
These services allow the higher level layers to be developed independent of the
underlying network-technology layers. The transport layer has several critical
functions related to achieving reliable data delivery to the higher layer such as
detecting and correcting erroneous packets, delivering packets in order, and
providing a flow control mechanism. Depending on the type of computer network
being used, achieving these functions may or may not be trivial. For instance,
operating over a packet-switching network with widely varying inter-packet delays
presents a challenging task for efficiently delivering ordered data packets to the
user; in this network, packets will experience excessive delays that makes decision
on the cause of the delay a very difficult task. The delay could be caused by a
network failure or by the network being congested. The transport protocol’s task is
to resolve this issue that could be by itself a time consuming task.
The session, presentation and application layers form the upper three layers in the
OSI reference model. In contrast to the lower four layers, which are concerned with
providing reliable end-to-end communication, the upper layers are concerned with
providing user-oriented services. They take error-free channel provided by the
transport layer, and add additional features that are useful to a wide variety of user
applications.
Session Layer: It provides mechanisms for organizing and structuring dialogues
between application layer processes. For example, the user can select the type of
synchronization and control needed for a session such as alternate two-way or
simultaneous operations, establishment of major and/or minor synchronization
points and techniques for starting data exchange.
Presentation Layer: The main task of this layer focuses on the syntax to be used for
representing data; it is not concerned with the semantics of the data. For example, if
the two communicating computers use different data representation schemes, this
layer task is then to transform data from the formats used in the source computer
into a standard data format before transmission. At the destination computer, the
received data is transformed from the standard format to the format used by the
destination computer. Data compression and encryption for network security are
issues of this layer as well [Tanenbaum, 1988; Coulouris and Dollimore, 1988].
Application Layer: This layer supports end-user application processes. This layer
contains service elements (protocols) to support application processes such as job
9
High Performance Distributed Systems
management function, file transfer protocol, mail service, programming language
support, virtual terminal, virtual file system, just to name a few.
2.2.3 Network Interfaces
The main function of host-network interface is to transmit data from the host to the
network and deliver the data received from the network to the host. Consequently,
The host-network interface interacts with upper layer software to perform functions
related to message assembly and de-assembly, formatting, routing and error
checking. With the advances in processing and memory technology, these
communication functions can now be implemented in the hardware of the network
interface. A tradeoff is usually made regarding how these functions are going to be
distributed between the host and the network interface. The more functions
allocated to the network interface, the fewer loads imposed on the host to perform
the communication functions; however, the cost of the network interface will
increase. The network interface can be a passive device used for temporary storing
the received data. In this case, the network interface is under the control of the
processor that performs all the necessary functions to transfer the received message
to the destination remote process. A more sophisticated network interface can
execute most of the communication functions such as assembling complete message
packets, passing these packets to the proper buffer, performing flow control,
managing the transmit and receive of message packets, interrupting the host when
the entire message has been received. In the coming Chapters, we will discuss in
more detail the design issues in host network interfaces.
2.3 Distributed System Architectures and Services
The main issues addressed in this layer are related to the system architecture and the
functions to be offered by the distributed system. The architecture of a distributed
system identifies the main hardware and software components of the system and
how they interact with each other to deliver the services provided by the system. In
addition to defining the system architecture and how its components interact, this
layer defines also the system services and functions that are required to run
distributed applications.
The architecture of a distributed system can be described in terms of several
architectural models that define the system structure and how the components
collaborate and interact with each other. The components of a distributed system
must be independent and be able to provide a significant service or function to the
system users and applications. In what follows, we describe the architectural models
and the system services and functions that should be supported by distributed
systems.
2.3.1 Architectural Models
10
High Performance Distributed Systems
The distributed system architectural models can be broadly grouped into four
models: Server Model, Pool Model, Integrated Model, and Hybrid Model [colorois,
ohio-os].
Server Model
The majority of distributed systems that have been built so far are based on the
server model (which is also referred to as the workstation or the client/server
model). In this model each user is provided with a workstation to run the
application tasks. The need for workstations is primarily driven by the user
requirements of a high-quality graphical interface and guaranteed application
response time. Furthermore, the server model supports sharing the data between
users and applications (e.g., shared file servers and directory servers). The server
model consists of workstations distributed across a building or a campus and
connected by a local area network (see Figure 2.4). Some of the workstations could
be located in offices, and thus be tied to a single user, whereas others may be in
public areas where are used by different users. In both cases, at any instant of time,
a workstation is either setting idle or has a user logged into it.
file se rv e r
p rin tin g se rv e r
co m p u tin g serv e r
N etw o r k
w o rk sta tio n s
. . .
Figure 2.4 Server Model
In this architecture, we do need communication software to enable the applications
running on the workstations to access the system servers. The term server refers to
application software that is typically running on a fast computer that offers a set of
services. Examples of such servers include compute engines, database servers,
authentication/authorization servers, gateway servers, or printers. For example, the
service offered by an authentication/authorization server is to validate user
identities and authorize access to system resources.
In this model, a client sends one or more requests to the server and then waits for a
response. Consequently, distributed applications are written as a combination of
clients and servers. The programming in this model is called synchronous
programming. The server can be implemented in two ways: single or concurrent
11
High Performance Distributed Systems
server. If the server is implemented as a single thread of control, it can support only
one request at a time; that is a client request that finds its server busy must wait for
all the earlier requests to complete before its request can be processed. To avoid this
problem, important servers are typically implemented as concurrent servers; the
services are developed using multiple lightweight processes (that we refer to
interchangeably as threads) in order to process several requests concurrently. In
concurrent server, after the request is received a new thread (child thread) is created
to perform the requested service whereas the parent thread keeps listening at the
same port for the next service requests. It is important to note that the client
machine participates significantly in the computations performed in this model; that
is not all the computations are done at the server and the workstations are acting as
input/output devices.
Pool Model
An alternative approach to organize distributed system resources is to construct a
processor pool. The processor pool can be a rack full of CPUs or a set of computers
that are located in a centralized location. The pool resources are dynamically
allocated to user processes on demand. In the server model, the processing powers
of idle workstations cannot be exploited or used in a straightforward manner.
However, in the processor pool model, a user process is allocated CPUs or
computing resources as much as needed and when that process is finished, all its
computing resources are returned to the pool so other processes can use them. There
is no concept of ownership here; all the processors belong equally to every process
in the system. Consequently, the processor pool model does not need any additional
software to achieve load balancing as it is required in the server model to improve
the system utilization and performance, especially when the number of computing
resources is large; when the number is large, the probability of finding computers
idle or lightly loaded is typically high.
In the pool model, programs are executed on a set of computers managed as a
processor service. Users are provided with terminals or low-end workstations that
are connected to the processor pool via a computer network as shown in Figure 2.5.
12
High Performance Distributed Systems
processor pool
supercomputer
servers
multicomputer
workstations
...
.
.
...
.
Network
:
processor array
. . .
terminals
Figure 2.5 Processor Pool Model
The processor pool model provides a better utilization of resources and increased
flexibility, when compared to the server model. In addition, programs developed
for centralized systems are compatible with this model and can be easily adapted.
Finally, processor heterogeneity can be easily incorporated into the processor pool.
The main disadvantages of this model are the increased communication between the
application program and the terminal, and the limited capabilities provided by the
terminals. However, the wide deployment of high speed networks (e.g., Gigabit
Ethernet) will make the remote access to the processor pool resources (e.g.,
supercomputers, high speed specialized servers) is attractive and cost-effective.
Furthermore, the introduction of hand-held computers (palm, cellular, etc.) will
make this model even more important; we can view the hand-held computers as
terminals and most of the computations and the services (e.g., Application Service
Provides) are provided by the pool resources.
Integrated Model
The integrated model brings many of the advantages of using networked resources
and centralized computing systems to distributed systems by allowing users to
access different system resources in a manner similar to that used in a centralized,
single-image, multi-user computing system. In this model each computer is
provided with appropriate software so it can perform both the server and the client
roles. The system software located in each computer is similar to the operating
system of a centralized multi-user computing system, with the addition of
networking software.
In the integrated model, the set of computing resources forming the distributed
system are managed by a single distributed operating system that makes them
appear to the user as a single computer system, as shown in Figure 2.6. The
individual computers in this model have a high degree of autonomy and run a
13
High Performance Distributed Systems
complete set of standard software. A global naming scheme that is supported
across the distributed system allows individual computers to share data and files
without regard to their location. The computing and storage resources required to
run user applications or processes are determined at runtime by the distributed
operating system such that the system load is balanced and certain system
performance requirement is achieved. However, the main limitation of this
approach is the requirement that the user processes across the whole system must
interact using only one uniform software system (that is the distributed operating
system). As a result, this approach requires that the distributed operating system be
ported to every type of computer available in this system. Further, existing
applications must be modified to support and interoperate with the services offered
by the distributed operating system. This requirement limits the scalability of this
approach to develop distributed systems with large number of heterogeneous logical
and physical resources.
w o r k s ta tio n s
serv ers
. . .
N e tw o rk
. . .
c o n c e n tr a to r
te r m in a ls
s u p e r c o m p u te r
Figure2.6 Integrated Model
Hybrid Model
This model can be viewed as a collection of two or more of the architectural models
discussed above. For example, the server and pool models can be used to organize
the access and the use of the distributed system resources. The Amoeba system is an
example of such a system. In this model, users run interactive applications on their
workstations to improve user response time while other applications run on several
processors taken from the processor pool. By combining these two models, the
hybrid model has several advantages: providing the computing resources needed for
a given application, parallel processing of user tasks on pool’s processors and the
ability to access the system resources from either a terminal or a workstation.
2.3.2 System Level Services
The design of a distributed computing environment can follow two approaches: topdown or bottom up. The first approach is desirable when the functions and the
services of a distributed system are well defined. It is typically used when
designing special-purposed distributed applications. The second approach is
desirable when the system is built using existing computing resources running
traditional operating systems. The structure of the existing operating systems (e.g.,
Unix) is usually designed to support a centralized time-sharing environment and
does not support the distributed computing environment. An operating system is the
software that provides the functions that allow resources to be shared between tasks,
and provides a level of abstraction above the computer hardware that facilitates the
14
High Performance Distributed Systems
use of the system by user and applications programs. However, the required
system-level services are greater in functionality than might normally exist in an
operating system. Therefore, a new set of system-wide services must be added on
top of the individual operating systems in order to run efficiently distributed
applications. Examples of such services include distributed file system, load
balancing and scheduling, concurrency control, redundancy management, security
service, just to name a few. The distributed file system allows the distributed system
users to transparently access and manipulate files regardless of their locations. The
load scheduling and balancing involves distributing the loads across the overall
system resources such that the overall load of the system is well balanced. The
concurrency control allows concurrent access to the distributed system resources as
if they were accessed sequentially (serializable access). Redundancy management
addresses the consistency and integrity issues of the system resources when some
system files or resources are redundantly distributed across the system to improve
performance and system availability. The security service involves securing and
protecting the distributed system services and operations by providing the proper
authentication, authorization, and integrity schemes.
In Part II chapters, we will discuss in detail the design and implementation issues of
these services.
2.4 Distributed Computing Paradigms
In the first layer of the distributed system design model, we address the issues
related to designing the communication system, while in the second layer, we
address the system architecture and the system services to be supported by a
distributed system. In the third layer, we address the programming paradigms and
communication models needed to develop parallel and distributed applications. The
distributed computing paradigms can be classified according to two models:
Computation and Communication models. The computation model defines the
programming model to develop parallel and distributed applications while the
communication model defines the techniques used by processes or applications to
exchange control and data information. The computation model describes the
techniques available to the users to decompose and run concurrently the tasks of a
given distributed application. In broad terms, there are two computing models: Data
Parallel, and Functional Parallel. The communication model can be broadly
grouped into two types: Message Passing, and Shared Memory. The underlying
communication system can support either one or both of these paradigms. However,
supporting one communication paradigm is sufficient to support the other type; a
message passing can be implemented using shared memory and vice versa. The
type of computing and communication paradigms used determine the type of
distributed algorithms that can be used to run efficiently a given distributed
application; what is good for a message passing model might not be necessarily
good when it is implemented using shared memory model.
2.4.1 Computation Models
15
High Performance Distributed Systems
Functional Parallel Model
In this model, the computers involved in a distributed application execute different
threads of control or tasks, and interact with each other to exchange information and
synchronize the concurrent execution of their tasks. Different terms have been used
in the literature to describe this type of parallelism such as control parallelism, and
asynchronous parallelism. Figure 2.7(a) shows the task graph of a distributed
application with five different functions (F1-F5). If this application is programmed
based on the functional parallel model and run on two computers, one can allocate
functions F1 and F3 to computer 1 and functions F2, F4 and F5 to computer 2 (see
Figure 2.7(b)). In this example, the two computers must synchronize their
executions such that computer 2 can execute function F5 only after functions F2
and F4 have been completed and computer 2 has received the partial results from
computer 1. In other words, the parallel execution of these functions must be
serializable; that is the parallel execution of the distributed application produces
identical results to the sequential execution of this application [Casavant, et al,
1996; Quinn, 1994].
Shared Data
START
START
Computer 1
F1
F2
F1
F2
F3
F4
F3
F4
F5
F5
Computer 2
END
(a)
(a)
END
(b)
(b)
16
High Performance Distributed Systems
Partitioned Shared Data
Computer 2
Computer 1
START
START
F1
F2
F1
F2
F3
F4
F3
F4
F5
F5
END
END
(c )
Figure2.7 (a) A block of a Task with five functions, (b) Functional
Parallel Model, and (c) SPMD Data Parallel Model.
Another variation to the functional parallel model is the host-node programming
model. In this model, the user writes two programs: the host and node programs.
The host program controls and manages the concurrent execution of the application
tasks by downloading the node program to each computer as well as the required
data. In addition, the host program receives the results from the node program. The
node program contains most of the compute-intensive tasks of the application. The
number of computers that will run the node program is typically determined at
runtime.
In general, parallel and distributed applications developed based on the functional
parallel model might lead to race conditions and produce imbalance conditions; this
occurs because the task completion depends on many variables such as the task
size, type of computer used, memory size available, current load on the
communication system, etc. Furthermore, the amount of parallelism that can be
supported by this paradigm is limited by the number of functions associated with
the application. The performance of a distributed application can be improved by
decomposing the application functions into smaller functions; that depends on the
type of application and the available computing and communication resources.
Data Parallel Model
In the data parallel model, which is also referred to as synchronous model, the
entire data set is partitioned among the computers involved in the execution of a
distributed application such that each computer is assigned a subset of the whole
data sets [Hillis and Steele, 1986]. In this model, each computer runs the same
program but each operates on a different data set, referred to as Single Program
Multiple Data (SPMD). Figure 2.7(c) shows how the distributed application shown
17
High Performance Distributed Systems
in Figure 2.7 (a) can be implemented using data parallel model. In this case, every
computer executes the five functions associated with this application, but each
computer operates on different data sets.
Data parallel model has been argued favorably by some researchers because it can
be used to solve large number of important problems. It has been shown that the
majority of real applications can be solved using data parallel model [Fox, Williams
and Messina, 1994]. Furthermore, it is easier to develop applications based on data
parallel paradigm than those written based on the functional parallel paradigm. In
addition, the amount of parallelism that can be exploited in functional parallel
model is fixed and is independent of the size of the data sets, whereas in the data
parallel model, the data parallelism increases with the size of the data [Hatcher and
Quinn, 1991]. Other researchers favored the functional parallel model since all large
scale applications can be mapped naturally into functional paradigm.
In summary, we do need to efficiently exploit both the functional and data
parallelism in a given large distributed application in order to achieve a high
performance distributed computing environment.
2.4.2 Distributed Communications Models
Message Passing Model
The Message Passing model uses a micro-kernel (or a communication library) to
pass messages between local and remote processes as well as between processes
and the operating system. In this model, messages become the main technique for
all interactions between a process and its environment, including other processes. In
this model, application developers need to be explicitly involved in writing the
communication and synchronization routines required for two remote processes or
tasks to interact and collaborate on solving one application. Depending on the
relationship between the communicating processes, one can identify two types of
message passing paradigms: peer-to-peer message passing, and master-slave
message passing. In the peer-to-peer Message Passing, any process can
communicate with any process in the system. This type is usually referred to by
message passing. In the master-slave type, the communications are only between
the master and the slave processes as in the remote procedure call paradigm. In
what follows, we briefly describe these two types of message passing.
In the peer-to-peer message passing model, there are two basic communications
primitives: SEND and RECEIVE that are available to the users. However, there are
many different forms to implement the SEND and RECEIVE primitives. This
depends on the required type of communication between the source and destination
processes: blocking or non-blocking, synchronous or asynchronous. The main
limitations of this model is that the programmers must consider many issues while
writing a distributed program such as synchronizing request and response messages,
handling data representations especially when heterogeneous computers are
involved in the transfer, managing machine addresses, and handling system failures
that could be related to communications network or computer failures [Singhal and
18
High Performance Distributed Systems
Mukesh, 1994]. In addition to all of these, debugging and testing Message Passing
programs are difficult because their executions are time-dependent and the
asynchronous nature of the system.
The remote procedure calls mechanism has been used to alleviate some of the
difficulties encountered in programming parallel and distributed applications. The
procedure call mechanism within a program is a well-understood technique to
transfer control and data between the calling and called programs. The RPC is an
extension of this concept to allow a calling program on one computer to transfer
control and data to the called program on another computer. The RPC system hides
all the details related to transferring control and data between processes and give
them the illusion of calling a local procedure within a program. The remote
procedure call model provides a methodology for communication between the
client and server parts of a distributed application. In this model, the client requests
a service by making what appears to be a procedure call. If the relevant server is
remote, the call is translated into a message using the underlying RPC mechanism
and then sent over the communication network. The appropriate server receives the
request, executes the procedure and returns the result to the client.
Shared Memory Model
In message passing model, the communication between processes is controlled by a
protocol and involves explicit cooperation between processes. In Shared memory
model, communication is not explicitly controlled and it requires the use of a global
shared memory. The two forms of communication models can be compared using
the following analogies: message communication resembles the operation of a
postal service in sending and receiving mail. A simpler form of message
communication can be achieved using a shared mailbox scheme. On the other hand,
the shared memory scheme can be compared to a bulletin board, sometimes found
in a grocery store or in a supermarket where users post information such as ads for
merchandise or help wanted notices. The shared memory acts as a central repository
for existing information that can be read or updated by anyone involved.
19
High Performance Distributed Systems
Shared Memory
Memory
Memory
CPU
Memory
...
CPU
CPU
Network
Figure2.8 Distributed Shared Memory Model
Most of distributed applications have been developed based on the message passing
model. However, the current advances in networking and software tools have made
it possible to implement distributed applications based on shared memory model. In
this approach, a global virtual address space is provided such that processes or tasks
can use this address space to point to the location where shared data can be stored
or retrieved. In this model, application tasks or processes can access shared data by
just providing a pointer or an address regardless of the location of where the data is
stored. Figure 2.8 shows how a Distributed Shared Memory (DSM) system can be
built using the physical memory systems available in each computer.
The advantages of the DSM model include easy to program, easy to transfer
complex data structures, no data encapsulation is required as is the case in message
passing model, and portability (program written for multiprocessor systems can be
ported easily to this environment) [Stumm and Zhou, 1990]. The main differences
between message passing and shared memory models can be highlighted as follows:
1) The communication between processes using shared memory model is simpler
because the communicated data can be accessed by performing reading operations
as if they were local. In the message passing system, a message must be passed
from one process to another. Many other issues must be considered in order to
transfer efficiently the inter-process messages such as buffer management and
allocation, routing scheme, flow control, and error control; and 2) Message passing
system is scalable and can support a large number of heterogeneous computers
interconnected by a variety of processor interconnect schemes. However, in shared
memory model, this approach is not as scalable as the Message Passing model
because the complexity of the system increases significantly when the number of
computers involved in the distributed shared memory becomes large.
20
High Performance Distributed Systems
2.5 Summary
Distributed computing systems field is relatively new and as a result there is no
general consensus on what constitute a distributed system and how to characterize
and design such type of computing systems. In this chapter, we have presented the
design issues of distributed systems in a three layer design model: 1) Network,
Protocol, and Interface (NPI) layer, 2) System Architecture and Services (SAS)
layer, and 3) Distributed Computing Paradigms (DCP) layer. Each layer defines the
design issues and technologies that can be used to implement the distributed system
components of that layer.
The NPI layer addresses the main issues encountered during the design of the
communication system. This layer is decomposed into three sub-layers: Networks,
Communication Protocols and Network Interfaces. Each sub-layer denotes one
important communication component (subsystem) required to implement the
distributed system communication system. The SAS layer represents the designers,
developers, and system managers’ view of the system. It defines the main
components of the system, system structure or architecture, and the system level
services required to develop distributed computing applications. Consequently, this
layer is decomposed into two sub-layers: architectural models and system level
services. The architectural models describe the structure that interconnects the main
components of the system and how they perform their functions. These models can
be broadly classified into four categories: server model, pool model, integrated
model, and hybrid model. The majority of distributed systems that are currently in
use or under development are based on the server model (which is also referred to
as workstation or client/server model). The distributed system level services could
be provided by augmenting the basic functions of an existing operating system.
These services should support global system state or knowledge, inter-process
communication, distributed file service, concurrency control, redundancy
management, load balancing and scheduling, fault tolerance and security.
The Distributed Computing Paradigm (DCP) layer represents the programmer
(user) perception of the distributed system. It focuses on the programming models
that can be used to develop distributed applications. The design issues of this can be
classified into two models: Computation and Communication Models. The
computation model describes the mechanisms used to implement the computational
tasks associated with a given application. These mechanisms can broadly be
described by two models: Functional Parallel, and Data Parallel. The
communication models describe how the computational tasks exchange information
during the application execution. The communication models can be grouped into
two types: Message Passing (MP) and Shared Memory (SM).
2.6 Problems
1. Explain about Distributed System Reference Model.
21
High Performance Distributed Systems
2. What are the main issues involved in designing a high performance distributed
computing system?
3. With the rapid deployment of ATM-based networks, hub-based networks are
expected to be widely used. Explain why these networks are attractive.
4. Compare the functions of data link layer with those offered by the transport
layers in the ISO OSI reference model.
5. Suppose you wanted to perform the task of finding all the primes in a list of
numbers, using a distributed system,
• Develop three distributed algorithms, based on each of the following
programming models to sort the prime numbers in the list: (i) funcational, (ii)
data, and (iii) remote procedure call.
• Choose one of the algorithms you described in part-1 and show how this
algorithm can be implemented using each of the following two communication
models: (i) message passing and (ii) shared memory.
6. Compare the distributed system architectural models by showing their
advantages and disadvantages. For each architectural model, define the set of
applications that are most suitable for that model.
7. You are asked to design a distributed system lab that supports the computing
projects and assignments of computer engineering students. Show how the
Distributed System Reference Model can be used to design such a system.
References
1. Liebowitz B.H., and Carson, J.H., ``Multiple Processor Systems for Real-Time
Applications'', Prentice-Hall, 1985.
2. Weitzman, Cay. ``Distributed micro/minicomputer systems: structure,
implementation, and application''. Englewood Cliffs, N.J.: Prentice-Hall, c1980.
3. Halsall, Fred. ``Data communications, computer networks and open systems''.
3rd ed. Addison-wesley 1992.
4. LaPorta, T.F., and Schwartz, M., ``Architectures, Features, and Implementations
of High-Speed Transport Protocols'', IEEE Network Magazine'', May 1991.
5. Mullender, S., Distributed Systems, Second Edition, Addison-Wesley, 1993.
6. Coulouris, G.F., Dollimore, J., Distributed Systems: Concepts and Design,
Addison-Wesley, 1988.
22
High Performance Distributed Systems
7. Hillis, W. D. and Steele, G. Data parallel algorithms, Comm. ACM, 29:1170,
1986.
8. Hatcher, P. J., and Quinn, M. J., Data-Parallel Programming on MIMD
Computers. MIT Press, Cambridge, Massachusetts, 1991.
9. Singhal, Mukesh. ``Advanced concepts in operating systems : distributed,
database, and multiprocessor operating systems''. McGraw-Hill, c1994.
10. IBM, ``Distributed Computing Environment Understanding the Concepts''. IBM
Corp. 1993.
11. M. Stumm and S. Zhou, "Algorithms Implememting Distributed Shared
Memory", Computer; Vol.23, No. 5, May 1990, pp. 54-64.
12. B. Nitzberg and V. Lo, "Distributed Shared Memory: A Survey of Issues and
Algorithms", Computer, Aug. 1991, pp. 52-60.
13. K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems",
ACM Trans. Computer Systems, Vol.7, No. 4, Nov. 1989, pp. 321-359.
14. K. Li and R. Schaefer, "A Hypercube Shared Virtual Memory System", 1989
Inter. Conf. on Parallel Processing, pp. 125-132.
15. B. Fleisch and G. Popek, "Mirage : A Coherent Distributed Shared Memory
Design", Proc. 14th ACM Symp. Operating System Principles, ACM ,NY 1989,
pp. 211-223.
16. J. Bennet, J. Carter, and W. Zwaenepoel, "Munin: Distributed Shared Memory
Based on Type-Specific Memory Coherence", Porc. 1990 Conf. Principles and
Practice of Parallel Programming, ACM Press, New York, NY 1990, pp. 168176.
17. U. Ramachandran and M. Y. A. Khalidi, "An Implementation of Distributed
Shared Memory", First Workshop Experiences with building Distributed and
Multiprocessor Systems, Usenix Assoc., Berkeley, Calif., 1989, pp. 21-38.
18. M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, Coherence, and
Event Ordering in Multiprocessors", Computer, Vol. 21, No. 2, Feb. 1998, pp.
9-21.
19. J. K. Bennet, "The Design and Implementation of Distributed Smalltalk", Proc.
of the Second ACM conf. on Object-Oriented Programming Systems,
Languages and Applications, Oct. 1987, pp. 318-330.
23
High Performance Distributed Systems
20. R. Katz, 5. Eggers, D. Wood, C. L. Perkins, and R. Sheldon, "Implementing a
Cache Consistency Protocol", Proc. of the 12th Annu. Inter. Symp. on
Computer Architecture, June 1985, pp. 276-283.
21. P. Dasgupta, R. J. LeBlane, M. Ahamad, and U Ramachandran, "The Clouds
Distributed Operating System," IEEE Computer, 1991, pp.34-44
22. B. Fleich and G. Popek, "Mirage: A Coherence Distributed Shared Memory
Design," Proc. 14th ACM Symp. Operating System Principles, ACM, New
York, 1989,pp.21 1-223.
23. D. Lenoskietal, "The Directoiy-Based Cache Coherence Pro to col for the Dash
Multiprocessor, "Proc. 17th Int'l Symp. Computer Architecture, IEEE CS Press,
Los Alamitos, Calif., Order No. 2047, 1990, pp. 148-159.
24. R. Bisiani and M. Ravishankar, "Plus: A Distributed Shared-Memoiy System,"
Proc. 17th Int'l Symp. Computer Architecture, WEE CS Press, Los
Alamitos,Calif., Order No. 2047,1990, pp.115-124.
25. J. Bennett, J. Carter, And W. Zwaenepoel. "Munin: Distributed Shared Memory
based on Type-Specific Memoiy Coherence, "Proc. 1990 Conf Principles and
Practice of Parallel Programming, ACM Press, New York, N.Y., 1990, pp.168176.
26. D. R. Cheriton, "Problem-oriented shared memoiy : a decentralized aproach to
distributed systems design ",Proceedings of the 6th Internation Conference on
Distributed Computing Systems. May 1986, pp. 190-197.
27. Jose M. Bernabeu Auban , Phillip W. Hutto, M. Yousef A. Khalidi, Mustaque
Ahamad, Willian F. Appelbe, Partha Dasgupta, Richard J. Leblanc and
Umarkishore Ramachandran, "Clouds--a distributed, object-based operating
system: architecture and kernel implication ", European UNIX Systems User
Group Autumn Conference, EUUG, October 1988, pp.25-38.
28. Francois Armand, Frederic Herrmann, Michel Gien and Marc Rozier, "Chorus,
a new technology for building unix systems", European UNIX systems User
Group Autumn Conference, EUUG, October 1988, ppi-18.
29. G. Delp. ``The Architecture and Implementation ofMemnet: A High-speed
Shared Memoy Computer Communication Network, doctoral disseration'',
University of Delaware, Nework, Del., 1988.
30. Zhou et al., "A Heterogeneous Distributed Shared Memory," to be published in
IEEE Trans. Parallel and Distributed Systems.
24
High Performance Distributed Systems
31. Geoffrey C. Fox, Roy D. Williams, and Paul C. Messina. ``Parallel Computing
Works!''. Morgan Kaufmann, 1994.
32. D.E.Tolmie, A.Tanlawy(ed), “High Performance Networks – Technology and
Protocols”, Norwell, Massachusetts, Kluwer Academic Publishers, 1994.
33. W. Stallings, “Advances in Local and Metropolitan Area Networks”, Los
Alamitos, California, IEEE Computer Society Press, 1994.
34. B.N.Jain, “Open Systems Interconnection: Its Architecture and Protocols”, New
York, McGraw-Hill, 1993.
35. T.L.Casavant, et al (ed), “Parallel Computers: Theory and Practice”, IEEE
Computer Society Press, 1996
36. M.J.Quinn, “Parallel Computing: Theory and Practice”, New York, McGrawHill, 1994
37. A. S. Tanenbaum, Computer Networks, 2nd Edition, Prentice-Hall, 1988.
25
Chapter 3
Computer Communication Networks
Objective of this chapter:
The performance of any distributed system is significantly dependent on its
communication network. The performance of the computer network is even more
important in high performance distributed systems. If the network is slow or
inefficient, the distributed system performance becomes slower than that can be
achieved using a single computer. The main objective of this chapter is to briefly
review the basic principles of computer network design and then focus on high
speed network technologies that will play an important role in the development
and deployment of high performance distributed systems.
Key Terms
LAN, MAN, WAN, LPN, Ethernet, CSMA/CD, FDDI, DQDB, ATM,
Infiniband, Wireless LAN
3.1 Introduction
Computer networking techniques can be classified based on the transmission
medium (fiber, copper, wireless, satellite, etc.), switching technique (packet
switching or circuit switching), or distance. The most widely used technique is the
one based on distance that classifies computer network into four categories: Local
Area Networks (LAN), Metropolitan Area Networks (MAN), Wide Area
Networks (WAN), and Local Peripheral Networks (LPN). A LPN covers a
relatively short distance (tens or hundreds of meters) and is used mainly to
interconnect input/output subsystems to one or more computers. A LAN covers a
building or a campus area (few Kilometers) and usually has a simple topology
(bus, ring, or star). Most of the current distributed systems are designed using
LANs that operate at 10 to 1000 million bits per second (Mbps). A MAN covers a
larger distance than a LAN, typically a city or a region (around 100 Kilometers).
LANs are usually owned and controlled by one organization, whereas MANs
typically use the services of one or more telecommunication providers. A WAN
can cover a whole country or one or more continents and it utilizes the network(s)
provided by several telecommunications carriers.
In this chapter, we briefly review each computer network type (LAN, MAN,
WAN and LPN) and then discuss in detail the high speed network technology
associated with each computer network type. The high speed networks to be
discussed in detail include Fiber Distributed Data Interface (FDDI), Distributed
Queue Data Buffer (DQDB) and Asynchronous Transfer Mode (ATM) networks.
A more detailed description of other types of computer networks can be found in
texts that focus mainly on computer networks [Tanenbaum, 1988; Stallings, 1995;
Strohl, 1991; Kalmanek, Kanakia and Keshav, 1990].
3.2 LOCAL AREA NETWORKS (LAN)
LANs typically support data transmission rates from 10 Mbps to Giga bit per
second (Gbps). Since LANs are designed to span short distances they are able to
transmit data at high rates with a very low error rate. The topology of a LAN is
usually simple and can be either a bus, ring or a star. The IEEE 802 LAN
standards shown in Figure 3.1 define the bottom three layers of the OSI Reference
Model. Layer one is the physical layer, layer two is composed of the medium
access control (MAC) sub-layer and the logical link control (LLC) sub-layer and
layer three is the network layer.
802.2 Logical Link Control (LLC)
802.1 Bridging
802.10
Sequrity
and
Privacy
802.1
Overview,
Architecture
and
Management
Data Link
Layer
MAC
MAC
MAC
MAC
MAC
MAC
MAC
CSMA
CD
Token
Bus
Token
Ring
IVD
Wireless
Future
MAN
802.3
802.4
802.5
802.6
802.9
802.11
Extensions
Physical
Layer
802.7 Broadband Tag
802.8 Fiber Optic Tag
Figure 3.1 IEEE 802 LAN Standards
The main IEEE LAN standards include Ethernet, Token Ring and FDDI. Ethernet
is until now by far the most widely used LAN technology, whereas FDDI is a
mature high speed LAN technology.
3.2.1 ETHERNET
Ethernet (IEEE 802.3) is the name of a popular LAN technology invented at
Xerox PARC in the early 1970s. Xerox used an Ethernet-based network to
develop a distributed computing system in which users on workstations (clients)
communicated with servers of various kinds including file and print servers. The
nodes of an Ethernet are connected to the network cable via a transceiver and a
tap. The tap and transceiver make the physical and logical connection onto the
Ethernet cable. The transceiver contains logic, which controls the transmission
and reception of serial data to and from the cable.
Ethernet uses Carrier Sense Multiple Access with Collision (CSMA/CD)
protocols to control access to the cable [Bertsekas, 1995; Dalgic, Chien, and
Tobagi, 1994].
CSMA is a medium access protocol in which a node listens to the Ethernet
medium before transmitting. The probability of the collision is reduced because
the transmitting node transmits its message only after it found that the
transmission medium is idle. When a node finds out that the medium is inactive, it
begins transmitting after waiting a mandatory period to allow the network to
settle. However, because of the propagation delay, there is a finite probability that
two or more nodes on the Ethernet will simultaneously find the medium in idle
state. Consequently, two (or more) transmissions might start at the same time and
that result in collisions.
In CSMA/CD, collisions are detected by comparing the transmitted data with the
data received from the Ethernet medium to see if the message on the medium
matches the message being transmitted. The detection of a collision requires that
the collided nodes to retransmit their messages. Retransmission can occur at
random or it can follow the exponential back-off rule; an algorithm used for the
generation of a random time interval that determines at what time each collided
station can transmit its message.
High Speed Ethernet Technologies
Recently, there has been an increased interest in the industry to revive Ethernet by
introducing switched fast Ethernet and Gigabit Ethernet.
Fast Ethernet is similar to Ethernet, only ten times faster than Ethernet. Unlike
other emerging High Speed network technologies, Ethernet has been installed for
over 20 years in business, government, and educational networks. Fast Ethernet
uses the same media access protocol (MAC) used in Ethernet (CSMA/CD
protocol). This makes the transition from Ethernet to fast Ethernet as well as the
inter-networking between Ethernet and fast Ethernet to be a straightforward. Fast
Ethernet can work with unshielded twisted-pair cable and thus can be built upon
the existing Ethernet wire. That makes fast Ethernet attractive when compared to
other high speed networks such as FDDI and ATM that require fiber optic cables,
which will make the upgrade of existing legacy network to such high speed
network technologies costly. In designing the fast Ethernet MAC and to make it
inter-operate easily with existing Ethernet networks, the duration of each bit
transmitted is reduced by a factor of 10. Consequently, this results in increasing
the packet speed by 10 times when compared to Ethernet, while packet format
and length, error control, and management information remain identical to those
of Ethernet. However, the maximum distance between a computer and the fast
Ethernet hub/switch depends on the type of cable used and it ranges between 100400 m.
This High Speed network technology is attractive because it is the same Ethernet
technology, but 10 or 100 times faster. Also, fast Ethernet can be deployed as a
switching or shared technology as is in Ethernet. However, its scalability and the
maximum distance are limiting factors when compared with fiber-based High
Speed network technology (e.g., ATM technology).
Gigabit Ethernet (GigE) is a step further from Fast Ethernet. It supports Ethernet
speeds of 1Gbps and above and supports Carrier Sense Multiple Access/ Collision
Detection (CSMA/CD) as the access method like the Ethernet. It supports both
half-duplex and duplex modes of operation. It preserves the existing frame size of
64-1518 bytes specified by IEEE 802.3 for Ethernet. GigE offers speeds
compared to ATM at much lesser cost and can support packets belonging to timesensitive applications in addition to video traffic. The IEEE 802.3z under
development will be the standard for GigE. Bit GigE can support a range of 3-5
kms only.
3.2.2 FIBER DISTRIBUTED DATA INTERFACE (FDDI)
FDDI is a high speed LAN proposed in 1982 by the X3T9.5 committee of the
American National Standard (ANSI). The X3T9.5 standard for FDDI describes a
dual counter-rotating ring LAN that uses a fiber optic medium and a token
passing protocol [Tanenbaum, 1988; Ross, 1989; Jain, 1991]. FDDI transmission
rate is 100 Mbps. The need for such a High Speed transmission rate has grown
from the need for a standard high speed interconnection between computers and
their peripherals. FDDI is suitable for front-end networks, typically with an office
or a building, which provides interconnection between workstations, file servers,
database servers, and low-end computers. Also, the high throughput of FDDI
network makes it an ideal network to build a high performance backbone that
bridge together several low speed LANs (Ethernet LANs, token rings, token bus).
Fig 3.2 FDDI ring structure showing the primary and secondary loops
FDDI uses optical fiber with light emitting diodes (LEDs) transmitting at a
nominal wavelength of 1300 nanometers. The total fiber path can be up to 200
kilo meters (km) and can connect up to 500 stations separated by a maximum
distance of 2 km. An FDDI network consists of stations connected by duplex
optical fibers that form dual counter-rotating rings as shown in Figure 3.2. One of
the rings is designated as the primary ring and the other one as the secondary. In
normal operation data is transmitted on the primary ring. The secondary ring is
used as a backup to tolerate a single failure in the cable or in a station. Once a
fault is detected, a logical ring is formed using both primary and secondary rings
and bypass the faulty segment or station as shown in Figure 3.3.
Figure 3.3 FDDI Rings
FDDI Architecture and OSI Model
The FDDI protocol is mainly concerned with only the bottom two layers in the
OSI reference model: Physical Layer and Data Link Layer as shown in Figure 3.4.
The Physical layer is divided into a Physical (PHY) Protocol sub-layer and the
Physical Medium Dependent (PMD) sub-layer. The PMD sub-layer focuses on
defining the transmitting and receiving signals, specifying power levels and the
types of cables and connectors to be used in FDDI. The physical layer protocol
focuses on defining symbols, coding and decoding techniques, clocking
requirements, states of the links and data framing formats.
The data Link Layer is subdivided into a Logical Link Control (LLC) sub-layer
(the LLC sub-layer is not part of the FDDI protocol specifications) and a Media
Access Control (MAC) sub-layer. The MAC sub-layer provides the procedures
needed for formatting frames, error checking, token handling and how each
station can address and access the network. In addition, a Station Management
(SMT) function is included in each layer to provide control and timing for each
FDDI station. This includes node configuration, ring initialization, connection and
error management. In what follows, we discuss the main functions and algorithms
used to implement each sub-layer in the FDDI protocol.
Figure 3.4 FDDI and the OSI Model
Physical Medium Dependent (PMD) Sub-Layer
The Physical Medium Dependent defines the optical hardware interface required
to connect to the FDDI rings. This sub-layer deals with the optical characteristics
such as the optical transmitters and receivers, the type of connectors to the media,
the type of optical fiber cable and an optional bypass optical switch. PMD layer is
designed to operate with multimode fiber optics with a wavelength of 1300
nanometer (NM). The distance between two stations is up to 2 kilometers in order
to guarantee proper synchronization and allowable data dependent jitter. When a
single mode fiber is used, a variation of the PMD sub-layer (SMF-PMD) should
be used. The single mode fiber extends the distance between two stations up to 60
kilometers. The PMD document aims to provide link transmission rate with bit
error rates (BER) of 2.5-10 at the minimum received power level, and with better
than 1012 when the power is 2 dB or more above the minimum received power
level.
Physical (PHY) Sub-Layer
The PHY sub-layer provides the protocols and optical hardware components that
support a link from one FDDI station to another. Its main functions are to define the
coding, decoding, clocking, and data framing that are required to send and receive
data on the fiber medium. The PHY layer supports duplex communications. That is,
it provides simultaneous transmission and receiving of data from and to the MAC
sub-layer. The PHY sub-layer receives data from MAC sub-layer at the data link
layer. It then encodes the data into 4B/5B code format before it is transmitted on the
fiber medium as shown in Figure 3.5. Similarly, the receiver receives the encoded
data from the medium, determines symbol boundaries based on the recognition of a
Start Delimiter, and then forwards decoded symbols to MAC sub-layer.
Data transmitted on the fiber is encoded in a 4 of 5 group code (4B/5B scheme) with
each group is referred to as a symbol as shown in Figure 3.5. With 5-bit symbols,
there are 32 possible symbols: 16 data symbols, each representing 4 bits of ordered
binary data; 3 for starting and ending delimiters, 2 are used as control indicators, and
3 are used for line-state signaling which is recognized by the physical layer
hardware. The remaining 8 symbols are not used since they violate code run length
and DC balance requirements [Stallings, 1983]. The 4B/5B scheme is relatively
efficient on bandwidth since 100 Mbps data rate is transmitted on the fiber optic at
125 Mbps rate. This is better than the Manchester encoding scheme used in Ethernet;
a 10 Mbps data transmission rate is transmitted on the medium at 20 Mbps. After
the data is encoded into 4B/5B symbols, it is also translated using a Non-Return to
Zero Inverted (NRZI) code before it is sent to the optical fiber. The NRZI coding
scheme reduces the number of transitions of the transmitted data streams and thus
reduces the complexity of FDDI hardware components [Black-Emerging]. In NRZI
coding scheme, instead of sending a zero bit as a logic level low (absence of light for
optical medium), a zero is transmitted as the absence of a transition from low to high
or from high to low. A one is transmitted as a transition from low to high or high to
low. The advantage of this technique is that it eliminates the need for defining a
threshold level. A pre-defined threshold is susceptible to a drift in the average bias
of the signal. The disadvantage of the NRZI encoding is the loss of the self-clocking
property as in Manchester encoding. To compensate for this loss, a long preamble is
used to synchronize the receiver to the sender's clock.
Figure 3.5 FDDI 4B/5B Code Scheme
The clocking method used in FDDI is point-to-point; all stations transmit using
their local clocks. The receiving stations decodes the received data by recognizing
that a bit 1 will be received when the current bit is the complement of the
previous bit and a bit 0 when the current bit is the same as the previous bit. By
detecting the transitions in the received data, the receiver station can synchronize
its local clock with the transmitter clock. An elasticity buffer (EB) function is
used to adjust the slight frequency difference between the recovered clock and the
local station clock. The elasticity buffer is inserted between the receiver, which
supports a variable frequency clock to track the clock of the previous transmitting
station, and the transmitter at the receiver side, which runs on a fixed frequency
clock. The elasticity buffer in each station is reinitialized during the preamble
(PA), which precedes each frame or token. The transmitter clock has been chosen
with 0.005% stability. With an elasticity buffer of 10 bits, frames of up to 4500
bytes in length can be supported without exceeding the limit of the elasticity
buffer.
Media Access Control (MAC) sub-layer
The MAC protocol is a timed token ring protocol similar to the IEEE standard
802.5. The MAC sub-layer controls the transmission of data frames on the ring.
The formats of the data and token frames are shown in Figure 3.6. The preamble
field is a string of 16 or more non-data symbols that are used to re-synchronize
the receiver's clock to the received frame. The frame control field contains
information such as to whether the frame is synchronous/asynchronous and
whether 16 or 48 bit addresses are used. The ring network must support 16 and 48
bit addresses as well as a global broadcast feature to all stations. The frame check
field is a 32-bit cyclic redundancy check (CRC) for the fields. The frame status
indicates whether the frame was copied successfully, an error was detected and/or
address was recognized. It is used by the source station to determine successful
completion of the transmission.
FRAME
PA
SD
FC
DA
SD
INFORMATION
FCS
ED
FS
Covered by CFS
TOKEN
PA
SA
FC
DA
Figure 3.6 Formats of FDDI Data and Token Frames
The basic concept of a ring is that each station repeats the frame it receives to its
next station [Ross, 1986; Stalling, 1983]. If the destination station address (DA)
of the frame matches the MAC’s address, then the frame is copied into a local
buffer and the LLC is notified of the frame’s arrival. MAC marks the Frame
Status (FS) field to indicate three possible outcomes: 1) Successful recognition of
the frame address, 2) the copying of the frame into a local buffer, or 3) the
deletion of an erroneous frame. The frame propagates around the ring until it
reaches the station that originally placed it on the ring. The transmitting station
examines the FS field to determine the success of the transmission. The
transmitting station is responsible for removing from the ring all its transmitted
frames; this process is referred to as frame stripping. During the stripping phase,
the transmitting station inserts IDLE symbols on the ring.
If a station has a frame to transmit then it can do so only after the token has been
captured. A token is a special frame, which indicates that the medium is available
for use as shown in Figure 3.3. FDDI protocol supports multiple priority levels to
assume the proper handling of frames. If the priority of a station does not allow it
to capture a token (its priority is less than the priority of the token), it must repeat
it to the next station. When a station captures the token, it removes it from the
ring, transmits one or more frames depending on the Token Rotation Time (TRT)
and Target Token Rotation Time (TTRT) as will be discussed later, and when it is
completed, it issues a new token. The new token indicates the availability of the
medium for transmission by another station.
Timed Token Protocol
FDDI protocol is a timed token protocol that allows a station to have a longer
period for transmission when previous stations do not hold the token too long;
they do not have any data frame to send and thus they relinquish the token
immediately. During the initialization, a target token rotation time (TTRT) is
negotiated, and the agreed value is stored in each station. The actual token
rotation time (TRT) is stored in each station and resets once the token arrives. The
amount of traffic, both synchronous and asynchronous, that FDDI allows on the
network is related to the following equation:
TTRT ≥ TRT+THT
where TRT denotes the token rotation time, that is the time since the token was
last received; THT denotes the token holding time, that is the time that the station
has held onto the token; and TTRT denotes the target token rotation time, that is
the desired average for the token rotation time.
Essentially, this equation states that on average, the token must circulate around
the ring within a pre-determined amount of time. This property explains why
FDDI protocol is known as ``timed token protocol''. The TTRT is negotiated and
agreed upon by all the stations at initialization of the network. The determination
of TTRT to obtain the best performance has been the subject of many papers and
is mainly determined by the desired efficiency of the network, the desired latency
in accessing the network, and the expected load on the network [Agrawal, Chen,
and Zhao, 1993]. TRT is constantly re-calculated by each station and is equal to
the amount of time since the token was last received. THT is the amount of time
that a station has held onto the token. A station that has the token and wants to
transmit a message must follow the following two rules:
1) Transmit any synchronous frames that are required to be transmitted.
Asynchronous frame may be transmitted If TTRT ≥ TRT + THT, before the token
is released and put back on the ring.
Synchronous traffic has priority over asynchronous traffic because of the
deadlines that need to be met. In order to reserve bandwidth for asynchronous
traffic, the amount of synchronous traffic allocated to each station is negotiated
and agreed upon at network initialization. In addition, FDDI has an asynchronous
priority scheme with up to 8 levels based upon the following inequality:
Ti ≥ TRT + THT
where Ti denotes the time allocated to transmit asynchronous traffic of priority i
(i priority can range from 1 to 8). FDDI also contains a multi-frame
asynchronous mode, which supports a continuous dialogue between stations. Two
stations may communicate multi-frame information by the use of a restricted
token. If a station transmits a restricted token instead of a normal token, then only
the station, which received the last frame may transmit. If both stations continue
to transmit only a restricted token, then a dedicated multi-frame exchange is
possible. This feature only affects asynchronous communication. Synchronous
communication is unaffected since all stations are still required to transmit any
synchronous frames.
Logical Link Control (LLC) Layer
The Logical Link Control Layer is the means by which FDDI communicates with
higher level protocols. FDDI does not define an LLC sub-layer but has been
designed to be compatible with the standard IEEE 802.2 LLC format.
Station Management (SMT) Functions
The Station Management Function monitors all the activities on the FDDI ring
and provide control over all stations functions. The main functions of the SMT
include [Ross, 1986]:
Fault Detection/Recovery
The FDDI protocol contains several techniques to detect, isolate and recover from
network failures. The recovery mechanisms can be grouped into two categories:
protocol related failures and physical failures.
Connection Management
This involves controlling the bypass switch in each FDDI node, initializing valid
PHY links, and positioning MACs on the appropriate FDDI ring.
Frame Handling
This function assists in network configuration. SMT uses a special frame, Next
Station Address (NSA) frame, to configure the nodes on the FDDI rings.
Synchronous Bandwidth Management
The highest priority in FDDI is given to synchronous traffic where fixed units of
data are to be delivered at regular time intervals. Delivery is guaranteed with a
delay not exceeding twice TTRT. The bandwidth required for synchronous traffic
is assigned first and the remaining bandwidth is allocated for the asynchronous
traffic. In what follows, we discuss the protocol used to initialize the TTRT
interval.
For proper operation of FDDI's timed token protocol, every station must agree
upon the value of the targeted token rotation time. This initialization of the
network is accomplished through the use of claim frames. If a station wants to
change the value of the TTRT, it begins to transmit claim frames with a new
value of TTRT. Each station that receives a claim frame must do one of two
things:
1) If the value of TTRT in the claim frame is smaller than its current value,
then use the new TTRT and relay the claim frame.
2) If the value of TTRT in the claim frame is greater than its current value,
then transmit a new claim frame with the smaller TTRT .
Initialization is complete when a station has received its own claim frame. This
means that all stations now have the same value of TTRT. The station that
received its own claim frame is now responsible for initializing the network to an
operational state. FDDI protocol guarantees that the maximum delay that can be
incurred in transmitting synchronous traffic is double the value of TTRT.
Consequently, if a station needs the delay to be less than an upper bound
(DELAYMAX), it attempts to set the TTRT to be equal to half of this upper bound,
i.e., TTRT = DELAYMAX /2.
A station can be connected to one or both rings and that connectivity determines
the type of protocol functions to be supported by each station. There are three
basic types of stations: Dual Attachment Station (DAS), Concentrator, and Single
Attachment Station (SAS). The DAS type requires two duplex cables, one to each
of the adjacent stations. The concentrator is a special DAS that provides
connection to the ring for several low-end Single Attachment Stations. In this
case, an SAS node is connected only to one ring and as a result, fault tolerance
can be supported with SAS nodes.
FDDI-II Architecture and OSI Model
One main limitation of the FDDI synchronous protocol is that although on
average frames will reach their destination at a periodic rate defined by TTRT,
there is a possibility that a frame may reach its destination with an elapsed time
greater than TTRT. This will occur under heavy network loading. For example,
assume that one station is required to send some synchronous traffic and when it
receives the token, the TRT is equal to TTRT. In this case, no asynchronous
frames can be sent but it is still required to transmit its synchronous frame. As a
result, the token's TRT will be greater than TTRT and this condition may cause a
glitch in an audio or video signal, which must be transmitted at a periodic rate.
This limitation of the FDDI's synchronous protocol has led to the development of
FDDI-II.
FDDI-II adds to FDDI circuit switching capability so that it can handle the
integration of voice, video and data over an FDDI network. In FDDI-II, the
bandwidth is allocated to circuit-switched data in multiples of 6.144 Mbps
isochronous channels. The term isochronous refers to the essential characteristics
of a time scale or a signal such that the time intervals between consecutive
significant instants either have the same duration or duration's that are multiples
of the shortest duration [Teener, 1989]. The number of isochronous channels can
be up to 16 channels using a maximum of 98.304 Mbps, where each channel can
be flexibly allocated to form a variety of highways whose bandwidths are
multiple of 8 Kbps (e.g., 8 Kbps, 64 Kbps, 1.53 Mbps or 2.04 Mbps).
Consequently, the synchronous and asynchronous traffic may have only 1.024
Mbps bandwidth when the all the 16 isochronous channels are allocated.
Isochronous channels may be dynamically assigned and de-assigned on a realtime basis with any of the unassigned bandwidth is allocated to the normal FDDI
traffic (synchronous and asynchronous).
Parameters
FDDI
ANSI X3 t 9.5
Token Ring
IEEE 802. 5
Data Rate
100 Mbps
Overall Length
100 Km
Nodes
500
96
1024
Distance between
Nodes
2 Km
0. 46 Km
0. 5 Km
8191 Octets
1514 Octets
Packet Size (max)
Medium
4500 Octets
Fiber
Meduim Access
Dual-ring token
passing
4 or 16 Mbps
Ethernet
IEEE 802. 3
1.2 Km
Twisted Pair / Fiber
Single-ring Token
passing
10 Mbps
2.5 Km
Coaxial Cable
CSMA / CD
Table 3.1: Comparison of Ethernet, FDDI and Token Ring
FDDI-II represents a modification to the original FDDI specification such that an
additional sub-layer (Hybrid Ring Controller - HRC) has been added to the Data
Link Layer. The HRC allows FDDI-II to operate in an upwardly compatible
hybrid mode that not only provides the standard FDDI packet transmission
capability but it also provides an isochronous transport mode. The function of the
HRC is to multiplex the data packets. This divides the FDDI-II data stream into
multiple data streams, one for each of the wide band channels that has been
allocated. More detailed information about FDDI-II circuit switched data format
and how bandwidth is allocated dynamically to isochronous, synchronous and
asynchronous traffic can be found in [Ross, 1986].
Copper FDDI (CDDI)
Another cost-effective alternative to fiber optic FDDI is another standard that
replaces the fiber optic cables by copper. The 100 Mbps copper FDDI (CDDI)
standard would use the same protocol as FDDI except that its transmission
medium would use the commonplace unshielded twisted-pair or shielded twistedpair copper wiring. The main advantage of using copper is that copper wiring,
connectors, and transceivers are much cheaper. The main tradeoff in using copper
wiring is that the maximum distance that could be traversed between nodes would
be limited to possibly 50 or 100 meters before electromagnetic interference
becomes a problem. This maximum distance is not a severe limiting factor since
the CDDI network would be used mainly for communication within a small LAN
that is physically located in one room or laboratory. The CDDI network could
then interface to the larger FDDI network through a concentrator station. In this
case, the FDDI network acts as a backbone network spanning large distances
interconnecting smaller CDDI LANs with a great savings in cost.
3.3 Metropolitan Area Networks
The DQDB is emerging as one of the leading technologies for high-speed
metropolitan area networks. DQDB is a media access control (MAC) protocol,
which is being standardized as the IEEE 802.6 standard for MANs [Stallings,
1995]. DQDB consists of two 150 Mbps contra-directional buses with two head
nodes, one on each bus, that continuously send fixed-length time slots (53 octets)
down the buses. The transmission on the two buses is independent and hence the
aggregate bandwidth of the DQDB network is twice the data rate of the bus. The
clock period of DQDB network is equal to 125 microseconds that has been
chosen to support isochronous services, that is voice services that require 8Khz
frequency.
The DQDB protocol is divided into three layers: the first layer from the bottom
corresponds to the physical layer of the OSI reference mode, the second layer
corresponds to the medium access sublayer, and the third layer corresponds to the
data-link layer as shown in Figure 3.7. DQDB protocols support three types of
services: connection-less, connection-oriented and isochronous services. The
main task of the convergence sublayer within a DQDB network is to map user
services into the underlying medium-access service. The connection-less service
transmits frames of length up to 9188 octets. Using fixed length slots of 52
octets, DQDB provides the capability to perform frame segmentation and reassembly. The connection-oriented service supports the transmission of 52-octet
segments between nodes interconnected by a virtual channel connection. The
isochronous service provides similar service to the- connection-oriented service,
but for users that require a constant inter-arrival time.
DQDB MAC Protocol
DQDB standard specifies the Medium Access Control and the physical layers.
Each bus independently transfers MAC cycle frames of duration 125
microseconds, each frame contains a number of short and fixed slots and frame
header. The frames are generated by the head node at each bus and flow
downstream passing each node before being discarded at the end of the bus.
There are two types of slots: Queued Arbitrated Slots (QA) and Pre-Arbitrated
Slots (PA). QA slots are used to transfer asynchronous segments and PA slots are
used to transfer isochronous segments. In what follows, we focus on how the
distributed queue algorithm controls the access to the QA slots.
LLC Services
Connection
Oriented Services
MCF
Services
CO
Isochronous
Other
ICF
DQDB
QA Functions
PA
LME
Common Layer
Physical Layer Convergence
Physical
LME
Transmission
Figure 3.7: Functional Block Diagram of a DQDB Node
The DQDB MAC protocol acts like a single first-in-first-out (FIFO) queue. At
any given time, the node associated with the request at the top of the queue is
allowed to transmit in the first idle slot on the required bus. However, this single
queue does not physically exist, but instead it is implemented in a distributed
manner using the queues available in each node. This is can be explained as
follows.
Each head of a bus continuously generates slots that contain in its header a BUSY
(BSY) bit and a REQUEST (REQ) bit. The busy bit indicates whether or not a
segment occupies the slot, while the REQ bit is used for sending requests for
future segment transmission. The nodes on each bus counts the slots that have the
request bit set and the idle slots that pass by, so that they can determine their
position in the global distributed queue and consequently determine when they
can start transmitting their data.
Several studies [Stallings, 1987] have shown that the DQDB MAC access
protocol is not fair because the node waiting time depends on its position with
respect to the slot generators. As a result, several changes have been proposed to
make DQDB protocol more fair [22]. Later, we discuss one approach, Bandwidth
Balancing Mechanism (BWB) to address the unfairness issue in DQDB.
The DQDB access mechanism associated with one bus can be implemented using
two queue buffers and two counters. Without loss of generality, we name the bus
A the forward bus and bus B the reverse bus. We will focus our description on
segment transmission in the forward bus, the procedure for transmission in the
reverse bus being the same. To implement the DQDB access mechanism on the
forward bus, each node contains two counters - Request counter (RC) and Down
counter (DC) and two queues, one for each bus. Each node can be in one of two
states: idle when there is no segment to transmit, or count down.
Idle State: When a node is in the idle state, the node keeps count of the
outstanding requests from its downstream nodes using the RC counter. The RC
counter increases by one for each request received in the reverse bus and
decreased by one for each empty slot in the forward bus; each empty slot on the
forward bus will be used to transmit one segment by downstream nodes. Hence,
the value of the Request counter (RC) reflects the number of outstanding requests
that have been reserved by the downstream nodes.
BUS A
B=0
Count Down (Cancel a
request) for each empty
QA slot on Bus A
Dump RQ, Join
Queue
Request Count
(RQ)
Countdown (CD)
Count requests
on Bus B ;
Increment RQ
BUS B
R=1
Figure 3.8 DQDB MAC Protocol Implementation
Count Down State: When the node becomes active and has a segment to
transmit, the node transfers the RC counter to the CD counter and resets the RC
counter to zero. The node then sends a request in the reverse bus by setting REQ
to 1 in the first slot with REQ bit equals to zero. The CD counter is decreased by
one for every empty slot on the forward bus until it reaches zero. Immediately
after this event, the node transmits into the first empty slot in the forward bus.
Priority Levels
DQDB supports three levels of priorities that can be implemented by using
separate distributed queues, and two counters for each priority level. This means
that each node will have six Request Counters and Down Counters, two for each
priority. Furthermore, the segment format will have three request bits, one for
each priority level. In this case, a node that wants to transmit on the Bus A with a
priority level, say, it will set the request bit corresponding to this priority level in
the first slot on Bus B that has not set the bit corresponding to priority. In this
case, the Down Counter (DC) is decremented with every free slot passing on Bus
A, but is incremented for every request on Bus B with a higher priority than the
counter priority. The Request Counter (RC) is incremented only when a passing
request slot has the same priority level; the higher priority requests have already
been accounted for in the Down Counter [Stallings, 1987].
DQDB Fairness
Several research results have shown that DQDB is unfair and the DQDB
unfairness depends on the medium capacity and the bus length [Conti, 1991]. The
unfairness in DQDB can result in unpredictable behavior at heavy load. One
approach to improve the fairness of DQDB is to use the bandwidth balancing
mechanism (BWB). In this mechanism, whenever the CD counter reaches a zero
and the station transmits in the next empty slot, it sends a signal to the bandwidth
balancing machine (BWB). The BWB machine uses a counter to count the
number of segments transmitted by its station. Once this counter reaches a given
threshold, referred to as BWB-MOD, the counter is cleared and the RQ-CTR is
incremented by one. That means, that this station will skip one empty slot to be
used by other downstream stations, which are further away from the slot generator
on the forward bus and thus improving DQDB fairness. The value of BWB-MOD
can vary from 0-16 where the 0 value means the bandwidth balancing mechanism
is disabled [conti, 1991].
.
Discussion
A MAN is optimized for a larger geographical area than a LAN, ranging from
several blocks of buildings to entire cities. As with local area networks, MANs
can also depend on communication channels of moderate-to-high data rates.
IEEE 802.6 is an important standard to cover this type of networks as well as
LANs. It offers several transmission rates that can initially start at 44.7 Mbps and
later expand it to speeds ranging from 1.544 Mbps to 155 Mbps. DQDB is
different from FDDI and token ring networks because it uses a high speed shared
medium that supports three types of traffic: bursty, asynchronous and
synchronous. Furthermore, the use of fixed-length packets, that are compatible
with ATM, provides an efficient and effective support for small and large packets
and for isochronous data.
3.4 Wide Area Networks (WANs)
The trend for transmission of information generated from facsimile, video,
electronic mail, data, and images has speeded up the conversion from analogbased systems to high speed digital networks. The Integrated Services Digital
Network (ISDN) has been recommended as a wide area network standard by
CCITT that is expected to handle a wide range of services that cover future
applications of high speed networks. There are two types of ISDN: Narrowband
ISDN (N-ISDN) and Broadband ISDN (B-ISDN). The main goal of N-ISDN is to
integrate the various services that include voice, video and data. The B-ISDN
supports high data rates (hundreds of Mbps). In this section, we discuss the
architecture and the services offered by these two types of networks.
3.4.1 Narrowband ISDN (N-ISDN)
The CCITT standard defines an ISDN network as a network that provides end-toend digital connectivity to support voice and non-voice services (data, images,
facsimile, etc.). The network architecture recommendations for ISDN should
support several types of networks: packet switching, circuit switching, nonswitched, and common-channel signaling. ISDN can be viewed as a digital bit
pipe in which multiple sources are multiplexed into this digital pipe. There are
several communication channels that can be multiplexed over this pipe and are as
follows:
• B channel: It operates at 64 Kbps rate and it is used to provide circuit
switched,
packet
switched
and
semi-permanent
circuit
interconnections. It is used to carry digital data, digitized voice and
mixtures of lower-rate digital traffic.
•
D channel: It operates at 16 Kbps and it is used for two purposes: for
signaling purposes in conjunction with circuit-switched calls on
associated B channels, and as a pipe to carry packet-switched or slowspeed telemetry information. For the H channels, three hybrid channel
speeds are identified: H0 channel that operates at 384 Kbps, H11
channel that operates at 1.536 Kbps, and H12 channel that operates at
1.92 Kbps. These channels are used for providing higher bit rates for
applications such as fast facsimile, high-speed data, high quality audio
and video.
Two combinations of these channels have been standardized: basic access rate
and primary access rate. The basic access consists of 2B+D channels, providing
192Kbps (including 48Kbps overhead). Typical applications which use this
access mode are those addressing most of the individual users including homes
and small offices, like simultaneous use of voice and data applications, teletext,
facsimile etc. These services could either use a one multifunctional terminal or
several terminals. Usually a single physical link is used for this access mode. The
customer can use all or parts of the two B channels and the D channel. Most
present day twisted pair loops will support this mode. The primary access mode is
intended for higher data rate communication requirements, which typically fall
under the category of nB+D channels. In this mode, the user can use all or part of
the B channels and the D channel. This primary access rate service is provided
using time division multiplexed signals over four-wire copper circuits or other
media. Each B channel can be switched independently; some B channels may be
permanently connected depending on the service application. The H channels can
also be considered to fall into this category.
Network Architecture and Channels
ISDN Reference Model
ISDN provides users with full network support by adopting the seven layers of
the OSI reference model. However, ISDN services are confined to the bottom
three layers (physical, data and network layers) of the OSI reference model.
Consequently, ISDN offers three main services [stallings, 1993]: Bearer Services,
Teleservices, and Supplementary Services. Bearer Services offer information
transfer without alteration in real time. This service corresponds to the OSI's
network service layer. There are various types of bearer services depending on the
type of application sought. Typical applications include speech and audio
information transfer.
Teleservices combine the data transportation (using bearer services) and
information processing. These services can be considered to be more user friendly
services and use terminal equipment. Typical applications are telephony, telefaxes
and other computer to computer applications. These correspond to all the
services offered by the several layers of the OSI reference model. Supplementary
Services are a mixture of one or more bearer or teleservices for providing
enhanced services which include Direct-Dial-in, Conference Calling and Creditcard Calling. A detailed description of these services can be found in [stallings,
1987].
•
•
•
Physical layer: This layer defines two types of interfaces depending on the
type of access namely Basic interface (basic access) and Primary interface
(primary access).
Data link layer: This layer has different link-access protocols (LAP)
depending on the channel used for the link, namely LAP-B (balanced for B
channel) and LAP-D (for D channel). Apart from these link access techniques,
frame-relay access is also a part of the protocol definition.
Network layer: This layer includes separate protocols for packet switching
(X.25), circuit switching, semi-permanent connection and channel signaling.
ISDN User-Network Interfaces
A key aspect of ISDN is that a small set of compatible user-network interfaces
can support a wide range of user applications, equipment and configurations. The
number of user-network interfaces are kept small to maximize user flexibility and
to reduce cost. To achieve this goal ISDN standards define a reference model
showing the functional groups and reference points between the groups.
Functional groups are sets of functions needed in ISDN user access arrangements.
Specific functions in the functional groups may be done in one or multiple pieces
of actual equipment. Reference points are conceptual points for dividing the
functional groups. In specific implementations, reference points may in fact
represent a physical interface between two functional groups. The functional
groups can be classified into two types of devices: Network Termination (NT1,
NT2 and NT12) and Terminal Equipment (TE1 and TE2). Network Termination 1
(NT1) provides functions similar to those offered by the physical layer of the OSI
Reference Model. Network Termination 2 (NT2) provides functions equivalent to
those offered by layers 1 through 3 of the OSI reference model (e.g., protocol
handling, multiplexing, and switching). These functions are typically executed by
equipment such as PBXs, LANs, terminal cluster controllers and multiplexers.
Network Termination 1,2 (NT12) is a single piece of equipment that combines the
functionality of NT1 and NT2. Terminal Equipment (TE) provides functions
undertaken by such terminal equipment as digital telephones, data terminal
equipment and integrated voice/data workstations. There are two types of TEs,
namely TE1 and TE2. TE1 refers to devices that support standard ISDN interface,
while TE2 are those which don't directly support ISDN interfaces. Such nonISDN interfacing equipment requires Terminal adapters (TA) to connect into
ISDN facility.
The reference points define the interface between the functional groups and these
include: Rate (R), System (S), Terminal (T), and User (U) reference points. The R
reference point is the functional interface between a non-ISDN terminal and the
terminal adapter. The S reference point is the functional interface seen by each
ISDN terminal. The T reference point is the functional interface seen by the
user's of NT1 and NT2 equipment. The U reference point defines the functional
interface between the ISDN switch and the network termination equipment
(NT2). Standardization of this reference point is essential, especially when NT1s
and the Central Office modules are manufactured by different vendors.
It is a generally accepted fact that ISDN can not only be used as a separate entity,
but also as tributary network and can play an important role in hybrid networks.
So applications that have been traditionally provided by different networking
schemes can now be provided in conjunction with ISDN. Some typical
applications for Video in the enterprise wide networks include video-telephony in
2B+D circuit switched networks, video conferences over public H0 and H11
links, reconfigure private video conferences networks over channel
switched/permanent H0 and H11 links. Medical imaging over 23B+D networks is
also one of the many partial lists of ISDN applications.
3.4.2 Broadband Integrated Service Data Network (B-ISDN)
With the explosive growth of network applications and services, it has been
recognized that ISDN's limited bandwidth cannot deliver the required bandwidth
for these emerging applications. Consequently, the majority of the delegates
within CCITT COM XVIII agreed that there is a need for a broadband ISDN (BISDN) that allows total integration of broadband services in 1985. And since then,
the original ISDN is referred as Narrow-band ISDN (N-ISDN).
The selected transfer mode for B-ISDN has changed several times since its
inception. So far two types of transfer modes have been used for digital data
transmission: Synchronous Transfer Mode (STM) and Asynchronous Transfer
Mode (ATM). STM is suitable for traffic that has severe real time requirements
(e.g., voice and video traffic). This mode is based on circuit switching service in
which the network bandwidth is divided into periodic slots. Each slot is assigned
to a call according to the peak rate of the call. However, this protocol is rigid and
does not support bursty traffic. The size of data packets transmitted on a computer
network varies dynamically depending on the current activity of the system.
Furthermore, some traffic on a data communication network is time insensitive.
Therefore, the STM is not selected for B-ISDN and the ATM, a packet switching
technique, is selected.
ATM technology divides voice, data, image, and video into short packets, and
transmits these packets by interleaving them across an ATM link. The packet
transmission time is equal to the slot length. In ATM the slots are allocated on
demand, while for STM periodic slots are allocated for every call. In ATM,
therefore, no bandwidth is consumed unless information is actually transmitted.
Another important parameter is whether the packet size should be fixed or
variable. The main factors that need to be taken into consideration when we
compare fixed packet size vs. variable packet sizes are the transmission
bandwidth efficiency, the switching performance (i.e. the switching speed, and
the switch's complexity) and the delay. Variable packet length is preferred to
achieve high transmission efficiency. Because with fixed packet length, a long
message has to be divided into several data packets. And each data packet is
transmitted with overhead. Consequently, the total transmission efficiency would
be low. However, with variable packet length, a long message can be transmitted
with only one overhead. Since the speed of switching depends on the functions to
be performed, with fixed packet length, the header processing is simplified, and
therefore the processing time is reduced. Consequently, from switching point of
view, fixed packet length is preferable.
From delay perspective, the packets with fixed small size result in minimal
functionalities at intermediate switches and take less time in queue memory
management; As a result, fixed size packets reduce the experienced delays in the
overall network. For broadband network, with large bandwidth, the transmission
efficiency is not as critical as high-speed throughput and low latency. The gain in
the transmission efficiency brought by the variable packet length strategy is
traded off for the gain in the speed and the complexity of switching and the low
latency brought by the fixed packet length strategy. In 1988, the CCITT decided
to use fixed size cells in ATM.
Another important parameter that the CCITT needed to determine, once it decided
to adopt fixed size cells, is the length of cells. Two options were debated in the
choice of the cell length, 32 bytes and 64 bytes. The choice is mainly influenced
by the overall network delay and the transmission efficiency. The overall end-toend delay has to be limited in voice connections, in order to avoid echo
cancellers. For a short cell length like 32 bytes, voice connections can be
supported without using echo cancellers. However, for 64 byte cells, echo
cancellers need to be installed. From this point of view, Europe was more in favor
of 32 bytes so echo cancellers can be eliminated. But longer cell length increases
transmission efficiency, which was an important concern to the US and Japan.
Finally a compromise of 48 bytes was reached in the CCITT SGXVIII meeting of
June 1989 in Geneva.33
In summary, ATM network traffic is transmitted in fixed cells with 48 bytes as
payload and another 5 bytes as header for routing through the network. The
network bandwidth is allocated on demand, i.e., asynchronously. The cells of
different types of traffic (voice, video, imaging, data, etc.) are interleaved on a
single digital transmission pipe. This allows statistical multiplexing of different
types of traffic if burst rate exceeds available bandwidth for a certain traffic type.
An ATM network is highly flexible and can support high-speed data transmission
as well as real-time voice and video applications.
3.5 ATM
3.5.1 Virtual Connections
Fundamentally ATM is a connection-oriented technology, different from other
connection-less LAN technologies. Before data transmission takes place in an
ATM network, a connection needs to be established between the two ends using a
signaling protocol. Cells then can be routed to their destinations with minimal
information required in their headers. The source and destination IP addresses,
which are the necessary fields of a data packet in a connection-less network, are
not required in an ATM network. The logical connections in ATM are called
virtual connections.
VCI= 1, 2, 3
VCI= 1, 2, 3
VPI = 1
VPI = 5
VCI= 4, 5, 6
TRANSMISSION PATH
VPI = 8
Figure 3.9 Relation between Transmission Path, VPs and VCs
Two layers of virtual connections are defined by CCITT: virtual channel (VC)
connections (VCC) and virtual path (VP) connections (VPC). One transmission
path contains several VPs, as shown in Figure 3.9, and some of them could be
permanent or semi-permanent.
Furthermore, each VP contains bundles of VCs. By defining VPC and VCC, a
virtual connection is identified by two fields in the header of an ATM cell: Virtual
Path Identifier (VPI) and Virtual Channel Identifier (VCI).
The VPI/VCI only have local significance per link in the virtual connection.
They are not addresses and are used just for multiplexing and switching packets
from different traffic sources. Hence ATM does not have the overhead associated
with LANs and other packet switched networks; where packets are forwarded
based on the headers and addresses that vary in location and size, depending on
the protocol used. Instead an ATM switch only needs to perform a mapping
between the VPI/VCI of a cell on the input link and an appropriate VPI/VCI value
on the output link.
•Virtual Channel Connection: Virtual channel connection is a logical endto-end connection. It is analogous to a virtual circuit in X.25 connection. It is
the concatenation of virtual channel links, which exist between two switching
points. A virtual channel has traffic usage parameters associated with it, such
as cell loss rate, peak rate, bandwidth, quality of service and so on.
•Virtual Path Connections: A virtual path connection is meant to contain
bundles of virtual channel connections that are switched together as one unit.
The use of virtual paths can simplify network architecture and increase
network performance and reliability, since the network deals with fewer
aggregated entities.
VPI = 5
VCI = 1,2,3
VPI = 20
VCI = 5, 6
VPI = 5
VCI = 2
VPI = 5
VCI = 4
VPI = 8
VCI = 8
VPI = 8
VCI = 1,2,3
VPI = 10
VCI = 5,6
VPI = 6
VCI = 10
Both VPI and VCI values can be
changed
VCI values are unchanged
a) VP Switching
a) VP/ VC Switching
Figure 3.10: Switching in ATM
The VPI/VCI fields in an ATM cell can be used to support two types of
switching: VP switching and VP/VC switching. In a VP switch, the VPI field is
used to route the cells in the ATM switch while the VCI values are not changed as
shown in Figure 3.10.
3.5.2 B-ISDN Reference Model
The ATM protocol reference model consists of the higher layers, the ATM layer,
the ATM Adaptation Layer (AAL), and the Physical layer. The ATM
reference/stack model differs from the OSI (Open System Interconnection) model
in its use of planes as shown in Figure 3.11. The portion of the architecture used
for user-to-user or end-to-end data transfer is called the User Plane (U-Plane).
The Control Plane (C-Plane) performs call connection control. The Management
Plane (M-Plane) performs functions related to resources and parameters residing
in its protocol entities.
Figure 3.11 Layers of ATM
ATM is connection-oriented, and it uses out-of-band signaling. This is in contrast
with the in-band signaling mode of the OSI protocols (X.25) where control
packets are inter-mixed with data packets. So during virtual channel connection
setup, only the control plane is active. In OSI model, the two planes are merged
and are indistinguishable.
ATM Layers
The ATM Layers are shown in Figure 3.11. We briefly discuss each of the layers.
Physical Layer
The Physical Layer provides the transport of ATM cells between two ATM
entities. Based on its functionalities, the Physical Layer is segmented into two
sublayers, namely Physical Medium Dependent (PMD) sublayer and the
Transmission Convergence (TC) sublayer. This sub-layering separates
transmission from physical interfacing, and allows ATM interfaces to be built on
a variety of physical interfaces. The PMD sublayer is device dependent. Its
typical functions include bit timing and physical medium like connectors. TC
sublayer generates and recovers transmission frames. The sending TC sublayer
performs the mapping of ATM cells to the transmission system. The receiving TC
sublayer receives a bit stream from PMD, extracts the cells and passes them on to
the ATM layer. It generates and checks HEC (header error control) field in the
ATM header, and it also performs cell rate decoupling through deletion and
insertion of idle cells.
Synchronous Optical Network (SONET)
The SONET (Synchronous Optical NETwork) also known internationally as
Synchronous Digital Hierarchy (SDH), is a physical layer transmission standard
of B-ISDN (Broadband Integrated Services Digital Network). SONET is a set of
physical layers, originally proposed by Bellcore for specifying standards for optic
fiber based transmission line equipment. It defines a set of framing standards,
which dictates how bytes are transmitted across the links, together with ways of
multiplexing existing line frames (T1, T3 etc.) into SONET. The lowest SONET
frame rate called STS-1, defines 8 Khz frames of 9 rows and 90 bytes First 3
bytes are used for Operation, Administration and Management (OA M) purposes
and the remaining 83 bytes are used for data. This gives a data rate of 51.84
Mbps. The next highest frame rate standard is STS-3 with 9 frames for OA M and
261 bytes for data, providing a 155.52 Mbps data rate. There are other higher
speed SONET standards available: STS-12 - 622.08 Mbps, STS-24 - 1.244 Gbps,
STS-48 - 2.488 Gbps and so on (STS-N - Nx51.84 Mbps).
The capabilities of SONET are mapped on to a 4-layer hierarchy-namely Photonic
(responsible for conversion between electrical and optical signals and
specification for physical layer), Section (functionalities between repeater and
multiplexer), Line (functionalities between multiplexers) and Path (function
between end-to-end transport).
CO Data Applications
(AAL Type 3, 5)
CO AAL
CS Sublayer
CL Data Applications
(AAL Type 4)
CL AAL
CS Sublayer
CBR Applications
(AAL Type 1)
CBR AAL
CS Sublayer
CBR Applications
(AAL Type 2)
VBR AAL
CS Sublayer
Segmentation and Reassembly AAL Sublayer
ATM Layer
Physical layer (SONET/ SDH)
Figure 3.12 ATM Protocol Stack
ATM Cell Format
The ATM Layer performs multiplexing and de-multiplexing of cells from
different connections (identified by different VPIs/VCIs) onto a single cell
stream. It extracts cell headers from received cells and adds cell headers to the
cells being transmitted. Translation of VCI/VPI may be required at ATM
switches. Figure 3.13 (a) shows the ATM cell format. Cell header formats for
UNI (User-Network Interface) and NNI (Network-Network Interface) are shown
in Figures 3.13 (b) and (c), respectively. The functions of the various fields in
the ATM cell headers are as follows:
•
Generic Flow Control (GFC): This is a 4-bit field used only across UNI to
control traffic flow across the UNI and alleviate short term overload
conditions, particularly when multiple terminals are supported across a single
UNI.
•
Virtual Path Identifier (VPI): This is an 8-bit field across UNI and 12-bits
across NNI. For idle cells or cells with no information the VPIs are set to
zero, this is also the default value for VPI. The use of non-zero values of VPI
across NNI is well understood (for trunking purposes), however the
procedures for accomplishing them are under study.
8 Bits
Header (5 bytes)
53 Bytes
Payload (48 bytes)
a) ATM Cell Format
GFC
VPI
VPI
VCI
GFC: Generic Flow Control
VPI: Virtual path Identifier
VCI: Virtual Channel Identifier
PTI: payload Type Identifier
CLP: Cell Loss Priority
HEC: Header Error Check
VPI
VPI
VCI
VCI
PTI
VCI
CLP
VCI
HEC
b) ATM Cell Header at UNI
VCI
PTI
CLP
HEC
c) ATM Cell Header at NNI
Figure 3.13 ATM Cell Format
•
Virtual Circuit Identifier (VCI): The 16-bit VCI is used to identify the
virtual circuit in a UNI or an NNI. The default value for VCI is zero.
Typically VPI/VCI values are assigned symmetrically; that is, the same values
are reserved for both directions across a link.
•
Payload Type Identifier (PTI): This is a 3-bit field for identifying the
payload type as well as for identifying the control procedures. When bit 4 in
the octet is set to 0, it means it is a user cell. For user cells, if bit 3 is set to 0,
it means that the cell did not experience any congestion in the relay between
two nodes. Bit 2 for user cell is used to indicate the type of user cell. When bit
4 is set to 1, it implies the cell is used for management functions as error
indications across the UNI.
•
Cell Loss Priority (CLP): This field is used to provide guidance to the
network in the event of congestion. The CLP bit is set to 1 if a cell can be
discarded during congestion. The CLP bit can be set by the user or by the
network. An example for the network setting is when the user exceeds the
committed bandwidth, and the link is under-utilized.
•
Header Error Check (HEC): This is an 8-bit Cyclic Redundancy Code
(CRC) computed over all fields in the ATM cell header. It is capable of
detecting all single bit errors and certain multiple bit errors. It can also be
used to correct single bit errors, but is not mandatory.
ATM Adaptation Layer
The AAL Layer provides the proper interface between the ATM Layer and the
higher layers. It enhances the services provided by the ATM Layer according to
the requirements of specific applications: real-time, constant bit rate or variable
bit rate. Accordingly, the services provided by AAL Layer can be grouped into
four classes. The AAL Layer has five types of protocols to support the four
classes of traffic pattern. The corresponding relation between the class of service
and the type of AAL protocol are as follows.
•
Type 1: Supports Class A applications, which require constant bit rate
(CBR) services with time relation between source and destination. Error
recovery is not supported. Examples include real-time voice messages,
video traffic and some current data video systems.
•
Type 2: Supports Class B applications, which require variable bit rate
(VBR) services with time relation between source and destination. Error
recovery is also not supported. Examples are teleconferencing and
encoded image transmission.
•
Type 3: Supports Class C applications, which are connection-oriented
(CO) data transmission applications. Time relation between source and
destination is not required. It is intended to provide services to the
applications that use a network service like X.25.
•
Type 4: Supports Class D applications, which are connection-less (CL)
data transmission applications. Time relation between source and
destination is not required. The current datagram networking applications
like TCP/IP or TP4/CLNP belong to Class D. Since the protocol formats
of AAL type 3 and type 4 are similar, they have been merged to AAL type
¾
•
Type 5: This type was developed to reduce the overhead related to AAL
type ¾. It supports connection-oriented services more efficiently. It is
more often referred to as ‘Simple and Efficient AAL’, and it is used for
Class C applications.
The AAL layer is further divided into 2 sublayers: the convergence sublayer (CS)
and the segmentation-and-reassembly sublayer (SAR). The CS is service
dependent and provides the functions needed to support specific applications
using AAL. The SAR sublayer is responsible for packing information received
from CS into cells for transmission and unpacking the information at the other
end. The services provided by the ATM and AAL Layers are shown in
An important character in ATM traffic is its burstiness, meaning that some traffic
sources may generate cells at a near-peak rate for a very short period of time and
immediately afterwards it may become inactive, generating no cells. Such a
bursty traffic source will not require continuous allocation of bandwidth at its
peak rate. Since an ATM network supports a large number of such bursty traffic
sources, statistical multiplexing can be used to gain bandwidth efficiency,
allowing more traffic sources to share the bandwidth. But if a large number of
traffic sources become active simultaneously, severe network congestion can
result.
In an ATM network, congestion control is performed by monitoring the
connection usage. It is called source policing. Every virtual connection (VPC or
VCC) is associated with a traffic contract which defines some traffic
characteristics such as peak bit rate, mean bit rate, and duration of burst time. The
network monitors all connections for possible contract violation. It is also a
preventive control strategy. Preventive control does not wait until congestion
actually occurs. It tries to prevent the network from reaching an unacceptable
level of congestion by controlling traffic flow at entry points to the network.
3.6 Peripheral Area Networks (PAN)
The current advances in computer and networking technology are changing the
design of the networks that interconnect computers with their peripherals. The use
of distributed computing systems allow users to transparently share and access
remote computing and peripheral resources available across the network. Hence
the complexity of the Local Peripheral Network (LPN)[Cummings, 1990].
Furthermore, the increased processing power of computers has also lead to a
significant increase in input/output bandwidth and in the number of required
channels. Applications performing intensive scientific, multimedia or database
work demand an increase in the input/output bandwidth of computers and
peripherals.
The current input/output peripheral standards cannot meet the required
input/output bandwidth. Even the cost of cabling and connections represent a
significant portion of the total system cost. The specifications of these standards
are as follows:
•
Small Computer Systems Interface (SCSI): This interface is enabled
with two features - 1) a base SCSI designed to support low end systems,
with a speed of 8-16 Mbps, and 2) a differential SCSI, designed to support
middle systems, connects 8 units, over a distance of 25 meters, with a
speed of 32 Mbps.
•
Intelligent Peripheral Interface (IPI): Designed to support middle
systems, IPPI connects a channel to eight control units, over a distance of
75 meters, with a speed of 48-80 Mbps.
•
IBM Block Mux (OEMI): Designed to support high end systems. A
channel can connect upto 7 units, over a distance of 125 meters, with a
speed of 24-58 Mbps.
•
High Performance Peripheral Interface (HIPPI): Designed to meet the
needs of supercomputing applications, the HIPPI channel can deliver 800
Mbps, over a 32-bit parallel bus, whose length can be upto 25 meters.
•
InfiniBand Architecture (IBA): IBA defines a System Area Network (SAN)
for connecting multiple independent processor platforms (i.e., host
processor nodes), I/O platforms, and I/O devices. The IBA SAN is a
communications and management infrastructure supporting both I/O and
inter-processor communications (IPC) for one or more computer systems.
An IBA system can range from a small server with one processor and a
few I/O devices to a massively parallel supercomputer installation with
hundreds of processors and thousands of I/O devices. Furthermore, the
internet protocol (IP) friendly nature of IBA allows bridging to an internet,
intranet, or connection to remote computer systems.
The Fiber Channel (FC) is a new standard prepared by the ANSI X3T9.3
committee that aims at providing an efficient LIN network that can operate at
speeds of gigabits per second. FC is designed to provide a general transport
vehicle supporting all the existing peripheral standards mentioned above. This is
achieved through the use of bridges which enable data streams from existing
protocols to be supported within the FC sub-network. In this case, the FC
provides a replacement of the physical interface later, thereby offering various
benefits including improved distance and speed. In this section, we'll focus on the
main features of the Fiber Channel and HIPPI standards because of their
importance to the development of High Performance Distributed Systems.
3.6.1 Fiber Channel Standard
Fiber Channel standard (FCS) has a five-layered structure to reduce the interdependency between the functional areas. This layered approach allows changes
in technology to improve the implementation of one layer without affecting the
design of other layers. For example, this is clearly illustrated at the FC-1 to FC-0
boundary, where the encapsulated data stream can be transmitted over a choice of
multiple physical interfaces and media. The functions performed by each layer is
outined below:
•
FC-4: Defines the bridges between existing channel protocols (IPI-3, SCSI,
HIPPI, Block Mux etc.) and FCS. These bridges provide a continuity of system
evaluation and provide a means of protecting the customer's investment in
hardware and software while at the same time enabling the use of FCS
capabilities.
•
FC-3: Defines the set of communication services, which is common across all
nodes. These services are available to all protocol bridges defined in FC-4 layer.
•
FC-2: Defines the single frame protocol on which FCS communication is based.
It also defines the control and data functions, which are contained within the
frame format.
•
FC-1: Defines the encoding and decoding scheme, which is associated with the
transmission frame stream. It specifies the special transmission sequences, which
are required to enable communication between the physical interfaces.
•
FC-0: Defines the physical interface that supports the transmission of data
through the FC network. This includes specifications for the fiber, connections,
and transceivers. These specifications are based on a variety of media, each
designed to meet a range of users, from low to high end implementations.
3.6.2 Infiniband Architecture
IBA defines a switched communications fabric allowing many devices to
concurrently communicate with high bandwidth and low latency in a protected,
remotely managed environment. An endnode can communicate over multiple IBA
ports and can utilize multiple paths through the IBA fabric. The multiplicity of
IBA ports and paths through the network are exploited for both fault tolerance
and increased data transfer bandwidth.
IBA hardware off-loads from the CPU much of the I/O communications
operation. This allows multiple concurrent communications without the
traditional overhead associated with communicating protocols. The IBA SAN
provides its I/O and IPC clients zero processor-copy data transfers, with no kernel
involvement, and uses hardware to provide highly reliable, fault tolerant
communications.
Figure 3.14 Infiniband Architecture System Area Network
An IBA System Area Network consists of processor nodes and I/O units
connected through an IBA fabric made up of cascaded switches and routers.
IBA handles the data communications for I/O and IPC in a multi-computer
environment. It supports the high bandwidth and scalability required for IO. It
caters to the extremely low latency and low CPU overhead required for IPC. With
IBA, the OS can provide its clients with communication mechanisms that bypass
the OS kernel and directly access IBA network communication hardware,
enabling efficient message passing operation.
IBA is well suited to the latest computing models and will be a building block for
new forms of I/O and cluster communication. IBA allows I/O units to
communicate among themselves and with any or all of the processor nodes in a
system. Thus an I/O unit has the same communications capability as any processor
node.
An IBA network is subdivided into subnets interconnected by routers as
illustrated in Figure 3.15. Endnodes may attach to a single subnet or multiple
subnets.
Figure 3.15 IBA Network Components
An IBA subnet is composed of endnodes, switches, routers, and subnet managers
interconnected by links. Each IBT device may attach to a single switch or
multiple switches and/or directly with each other.
The semantic interface between the message, data service and the adapter is
referred to as IBA verbs. Verbs describe the functions necessary to configure,
manage, and operate a host channel adapter. These verbs identify the appropriate
parameters that need to be included for each particular function. Verbs are not an
API, but provide the framework for the OSV to specify the API.
IBA is architrected as a first order network and as such it defines the host
behavior (verbs) and defines memory operation such that the channel adapter can
be located as close to the memory complex as possible. It provides independent
direct access between consenting consumers regardless of whether those
consumers are I/O drivers and I/O controllers or software processes
communicating on a peer to peer basis. IBA provides both channel semantics
(send and receive) and direct memory access with a level of protection that
prevents access by non participating consumers.
The foundation of IBA operation is the ability of a consumer to queue up a set of
instructions that the hardware executes. This facility is referred to as a work
queue. Work queues are always created in pairs, called a Queue Pair (QP), one for
send operations and one for receive operations as shown in Figure 3.16. In
general, the send work queue holds instructions that cause data to be transferred
between the consumer’s memory and another consumer’s memory, and the
receive work queue holds instructions about where to place data that is received
from another consumer. The other consumer is referred to as a remote consumer
even though it might be located on the same node.
Figure 3.16 Consumer Queuing Model
The architecture provides a number of IBA transactions that a consumer can use
to execute a transaction with a remote consumer. The consumer posts work queue
elements (WQE) to the QP and the channel adapter interprets each WQE to
perform the operation.
For Send Queue operations, the channel adapter interprets the WQE, creates a
request message, segments the message into multiple packets if necessary, adds
the appropriate routing headers, and sends the packet out the appropriate port. The
port logic transmits the packet over the link where switches and routers relay the
packet through the fabric to the destination. When the destination receives a
packet, the port logic validates the integrity of the packet. The channel adapter
associates the received packet with a particular QP and uses the context of that
QP to process the packet and execute the operation. If necessary, the channel
adapter creates a response (acknowledgment) message and sends that message
back to the originator.
Reception of certain request messages cause the channel adapter to consume a
WQE from the receive queue. When it does, a CQE corresponding to the
consumed WQE is placed on the appropriate completion queue, which causes a
work completion to be issued to the consumer that owns the QP.
The devices in an IBA system are classified as: Switches, Routers, Channel Adapters,
Repeaters, and Links that interconnects switches, routers, repeaters, and channel adapters
The management infrastructure includes subnet managers and general service
agents.
IBA provides Queue Pairs (QP). The QP is the virtual interface that the hardware
provides to an IBA consumer and it provides a virtual communication port for the
consumer. The architecture supports up to 224 QPs per channel adapter and the
operation on each QP is independent from the others. Each QP provides a high
degree of isolation and protection from other QP operations and other consumers.
Thus a QP can be considered a private resource assigned to a single consumer.
The consumer creates this virtual communication port by allocating a QP and
specifying its class of service. IBA supports the services shown in Table 3.2.
Table 3.2 Service Types supported by QP
Discussion
Gigabit networks represent a change in kind, not just degree. Substantial progress
must be made in the areas of protocol, high-speed computer interfaces and
networking equipment. The challenge of the 90's will be resolving these problems
as well as providing the means for disparate networking approaches to
communicate with each other smoothly and efficiently. Ultimately, however, it
seems that we are moving towards a single, public, networking environment biult
on an infrastructure of gigabit-speed fiber optic links, most probably defined at
the physical layer by a FDDI or Sonet standard. ATM protocol support
multimedia information like voice, video and data in one integrated networking
environment. Infiniband supports connections between many nodes and
processors spanning many networks and for applications requiring different QoS.
3.7 Wide Area Networks (WANs)
WANs are built to provide communication solutions for organizations or people
who need to exchange digital information between two distant places. Since the
distance is large, the local telecommunication company is involved, in fact,
WANs are usually maintained by the country's public telecommunication
companies (PTT's - like AT&T, Sprint), which offer different communication
services.
The main purpose of a WAN is to provide reliable, fast and safe communication
between two or more places (Nodes) with low delays and at low prices. WANs
enable an organization to have one integral network between all its departments
and offices, even if they are not all in the same building or city, providing
communication between the organization and the rest of the world. In principle,
this task is accomplished by connecting the organization (and all the other
organizations) to the network nodes by different types of communication
strategies and applications. Since WANs are usually developed by the PTT of
each country, their development is influenced by each PTT's own strategies and
politics.
The basic WAN service that the PTT usually offers (for many years) is a Leased
Line. A Leased Line is a point-to-point connection between two places,
implemented by different transmission media (usually through PSTN Trunks),
which creates one link between its nodes. An organization whose networks are
based on such lines has to connect each office with one line, meaning that each
office is connected to as many lines as the number of offices it is connected to.
The Packet Switched WAN appeared in the 1960's, and defined the basis for all
communication networks today. The principle in Packet Switched Data Network
(PSDN) is that the data between the nodes is transferred in small packets. This
principle enables the PSDN to allow one node to be connected to more than one
other node through one physical connection. That way, a fully connected network,
between several nodes, can be obtained by connecting each node to one physical
link.Another advantage for Packet Switching was the efficient use of resources by
sharing the Network bandwidth among the users (instead of dividing).
The first communication Packet Switched Networks were based on the X.25
packet switching protocol. X.25 networks became the de facto standard for non
permanent data communication and was adopted by most PTT's.
X.25 networks enabled cheaper communication, since their tariff was based on
the communication time and the amount of data transferred. X.25 networks used
the PTT's transmission networks more efficiently since the bandwidth was
released at the end of the connection, or when no data was transmitted. Another
advantage of X.25 was that it allowed easy implementation of international
connections enabling organizations to be connected to data centers and services
throughout the world. By the 1980's, X.25 networks were the main international
channel for commercial data communication.
Today to meet the high speed demands, the WANs rely on technologies ATM (BISDN) Frame Relay, SONET and SDH. We have already discussed ATM,
SONET and SDH in the previous section on MAN.
3.8 Wireless LANS
3.8.1 Introduction
Wireless LANs provide many convenient facilities that are not provided by
traditional LANs such as mobility, relocation, ad hoc networking and coverage of
locations difficult to wire. Wireless LANs were not of much practical use until the
recent past, due to many technological and economical reasons such as high
prices, low bandwidth, transmission power requirements, infrastructure and
licensing. These concerns have been adequately addressed over the last few years,
and the popularity of wireless LANs are increasing rapidly, and a new standard,
namely IEEE 802.11, attempts to standardize these efforts.
3.8.2 IEEE 802.11
IEEE 802.11 defines a number of services that wireless LANs are required to
provide functionality equivalent to wired LANs. The services specified in this
standard are:
1. Association: Establishes an initial association between a station and an
access point. A LAN's identity and address must be known and confirmed
before the station can start transmitting.
2. Re-association: Enables an established connection to be transferred
from one access point to another, allowing a mobile user to move from
one station to another.
3. Disassociation: A notification from either a station or an access point
that an existing association is terminated. A mobile station has to notify
the base station before it shuts down; however, the base stations have a
capability to protect themselves against stations that shut down without
any notification.
4. Authentication: Used to establish the identity of the stations to each
other. In a wired LAN, the physical connection always conveys the
identity of other station. Here, an authentication scheme has to be used to
establish the proper identity of thestations. Though the standards do not
specify any particular authentication scheme, the methods used can range
from the relatively unsecure handshaking to a public-key encryption
scheme.
5. Privacy: used to prevent the broadcasted messages being read by users
other than the intended recipient. The standard provides for an optional
use of encryption to provide a high level of privacy.
The 802.11 standard also specifies three kinds of physical media standards for
wireless LANs:
- Infrared at 1 and 2 Mbps operating at a wavelength between 850 and 950nm
- Direct-sequence spread-spectrum operating in the 2.4-GHz ISM band, upto
seven channels with data rates of 1 or 2 Mbps can be used.
- Frequency hopping spread spectrum operating in the 2.4-GHz ISM band
3.8.3 Classification
Wireless LANs are classified according to the transmission techniques used, and
all current wireless LAN products fall into one of the three following categories:
1. Infrared LANs: In this case, an individual cell of an IR LAN is limited
to a very small distance, such as a single room. This is because infrared
rays cannot penetrate opaque walls.
2. Spread Spectrum LANs: This makes use of spread-spectrum
technology for transmission. Most of the networks in this category operate
in the bands where no licensing is required from the FCC.
3. Narrowband Microwave: These LANs operate at very high microwave
frequencies, and FCC licensing is required.
3.8.4 Applications of Wireless LAN Technology
1. Nomadic Access
Nomadic access refers to the wireless link between a wireless station and a
mobile terminal (a notebook or laptop) equipped with an antenna. This gives
access to users who are on the move, and wish to access the main hub from
different locations.
2. Ad-Hoc Networking
An ad-hoc network is a network set-up temporarily to meet some immediate need,
such as conferences and demonstrations. No infrastructure is required for an adhoc network, and with the help of wireless technologies, a collection of users
within range of each other may dynamically configure themselves into a
temporary network.
3. LAN Extension
A wireless LAN saves the cost of installation of LAN cabling and eases the task
of relocation and other modifications and extensions to the existing network
structure.
4. Cross Building Interconnect
Point-to-point wireless links can be used to connect LANs in nearby buildings,
independent of whether the buildings within themselves have wired or wireless
LANs. Though this is not a typical use of wireless LANs, this is also usually
included as an application for the sake of completeness.
Summary
Computer networks play an important role in the design of high performance
distributed systems. We have classified computer networks into four types:
Peripheral Area Networks, Local Area Networks, Metropolitan Area Networks
and Wide Area Networks. For each class, we discussed the main computer
technology that can be used to build high performance distributed systems. These
networks include HIPPI, Fiber Channel, FDDI, DQDB, and ATM. Gigabit
networks represents a change in kind, not just degree. Substantial progress must
be made in the areas of protocols, high speed computer interfaces and networking
equipment before they can be widely used. The challenge of the 90's will be
resolving these problems as well as providing the means for disparate networking
approaches to communicate with each other smoothly and efficiently. Ultimately,
however, it seems that we are moving toward a single, public networking
environment built on an infrastructure of gigabit-speed fiber optic links, most
probably defined at the physical layer by a merged Fiber Channel/Sonet standard.
ATM protocols will become popular format for sharing multimedia voice, video,
and data in one integrated networking environment. Further, ATM has the
potential to implement all classes of computer networks and provide the required
current and future communication services.
Problems
1.In some cases FDDI synchronous transmission causes a ``glitch'' in an audio or video
signal that must be transmitted at a periodical rate. Describe a scenario where this
glitch can occur and suggest a solution to solve this problem.
2.If source computer A wants to send a file of size 10 Kbytes to destination computer B
and communication has to take place over a HIPPI channel. Explain how this
exchange would take place, with regard to data framing and physical layer control
signal sequencing.
3.Discuss the following: Enumerate the appropriate place for FDDI protocol into the OSI
reference model. The use of claim frames in the ring initialization process. The
guarantee of TTRT (target token rotation time), used by the FDDI protocol, for
maximum delay on the ring. Enumerate
4.How can the Target Token Rotation Time affect the performance of a FDDI network?
5.What are the advantages of using copper as the transmission medium for FDDI?
6.What are the differences between FDDI II and FDDI? How is the former an
improvement over the latter?
7.What is the most cost efficient configuration for a large FDDI network? Why?
8.What are the differences between Ethernet, FDDI and Token Ring standards for
forming Local Area Networks?
9.DQDB MAC protocol is biased towards the nodes that are close of the slot generators.
Explain this scenario and describe one technique to make DQDB protocol more fair.
10.ATM mainly uses virtual connections for establishing a communication path between
nodes. Describe the relationship between virtual connection, virtual path, and virtual
channel. Based on this relationship, how are the VPI and VCI identifiers used in
performing cell switching in the ATM switches. Do you think the sizes of VPI and
VCI fields are large enough for holding the required switching information?
11.What are the advantages of having a large packet size? What are the advantages of
having a small packet size? Why does ATM prefer smaller packet sizes?
12.What are the different classes supported by the ATM Adaptation Layer? On what
basis are the classes divided into, and how?
13.ISDN offers three main services. Describe these services and their applications to real
life examples.
14.What is the SONET? Describe its capabilities and limitations.
15.What are the main characteristics of HIPPI. Discuss functions of HIPPI-FP (framing
protocol) and HIPPI-LE (link encapsulation).
References
1.Tanenbaum, A. S., 'Computer Networks', Prentice Hall, 1988
2.Stallings, W., `Local & Metropolitan Area Networks', Macmillan, 1995
3.Nim K. Cheung, ``The Infrastructure for Gigabit Computer Networks'', IEEE
Communications Magazine, April, 1992, page 60.
4.E. Ross, `` An overview of FDDI: The Fiber Distributed Data Interface,'' IEEE Journal
on Selected Areas in Communications, pp. 1043--1051, September 1989.
5.CCITT COM XVIII-R1-E, February 1985
6.J.L.Boudec, ``The Asynchronous Transfer Mode: a tutorial,'' Computer Networks and
ISDN Systems 24(1992) North-Holland, pp.279-309..
7.Spragins, J.D., Hammond, J.L., and Pawlikowski, K., ``Telecommunications Protocols
and Design'', Addison Wesley, 1991.
8.Brett Glass, ``The Light at The End of The LAN,'' BYTE, pp269-274, July 1989.
9.Floyd Ross, ``FDDI-Fiber, Farther, Faster,'' Proceedings of the 5th Annual Conference
on Local Computer Networks, April 8-10, 1986.
10.Marjory Johnson, ``Reliability Mechanisms of the FDDI High Bandwidth Token Ring
Protocol,'' Proceedings of the 10th Annual Conference on Local Computer Networks,
October 1985.
11.Michael Teener, ``FDDI-II Operation and Architecture's,'' Proceedings of the 14th
Annual Conference on Local Computer Networks, October 1989.
12.William Stallings, ``FDDI Speaks,'' Byte, Vol. 18, No. 4, April 1983, page 199.
13.ANSII-X3.148, American National Standards Institute, Inc., ``Fiber Distributed Data
Interface (FDDI) - Token Ring Physical Layer Protocol (PHY),'' X3.148-1988.
14.ANSII-X3.139, American National Standards Institute, Inc., ``Fiber Distributed Data
Interface (FDDI) - Token Ring Media Access Control (MAC),'' X3.139-1987.
15.Tnagemann and K. Sauer, ``Performance Analysis of the Timed Token Protocol of
FDDI and FDDI-II,'' IEEE Journal on Selected Areas in Communications, Vol. 9, No.
2, February 1991.
16.Saunders, ``FDDI Over Copper:Which Cable Works Best?,'' Data Communications,
November 21, 1991.
17.Wilson, ``FDDI Chips Struggle toward the Desktop,'' Computer Design, February
1991.
18.J. Strohl, ``High Performance Distributed Computing in FDDI Networks,'' IEEE LTS,
May 1991.
19.Stallings, ``Handbook of Computer-Communications Standards,'' Vol. 2, Local
Network Standards, Macmillan Publishing Company, 1987.
20.William Stallings, ``Networking Standards: A guide to OSI, ISDN, LAN, and MAN
standards'', Addison-Wesley, 1993.
21.Jain, ``Performance Analysis of FDDI Token Ring Networks: Effect of Parameters
and Guidelines for Setting TTRT,'' IEEE LTS, May 1991.
22.Fdida and H. Santoso, ``Approximate Performance Model and Fairness Condition of
the DQDB Protocol,'' in High-Capacity Local and Metropolitan Area Network, edited
by G. Pujolle, NATO ASI Serires, Vol. F72, pp. 267-283, Springer-Verlag Berlin
Heidelberg 1991.
23.Mark J. Karol, and Richard D. Gitlin, "High performance Optical Local and
Metorpolitan Area Networks: Enhancements of FDDI and IEEE 802.6 DQDB," IEEE
J. on Selected Areas in Communications, Vol. 8, No. 8, October 1990, pp. 14391448.
24.Marco Conti, Enrico Gregori, and Luciano Lenzini, "A Methodology Approach to an
Extensive Analysis of DQDB Performance and Fairness," IEEE J. on Selected Areas
in Comm, Vol. 9, No. 1, January 1991, pp. 76-87.
25.Fiber Channel Standard-XT/91-062.
26.American National Standard for Information Systems, Fiber Channel: Fabric
Requirements (FC-FG), FC-GS-92-001/R1.5, May 1992.
27.Cummings, "New Era Dawns for Peripheral Channels, " Laser Focus World,
September 1990, pp. 165-174.
28.Cummungs, ``Fiber Channel- the Next Standard Peripheral Interface and More,''
FDDI, Campus-wide and Metropolitan Area Networks SPIE Vol. 1362, 1990, pp.
170-177.
29.J. Cypser, ``Communications for Cooperating Systems: OSI, SNA, and TCP/IP,''
Addison-Wesley, Reading, 1991.
30.Doorn, ``Fiber Channel Communications: An Open, Flexible Approach to
Technology,'' High Speed Fiber Netowrks and Channels SPIE Vol. 136, 1991, pp.
207-215.
31.Savage, ``Impact of HIPPI and Fiber Channel Standards on Data Delivery,'' Tenth
IEEE Symposium on Mass Storage, May 1990, pp. 186-187.
32.American National Standard for Information Systems, Fiber Channel: Physical and
Signaling Innterface (FC-PH), FC-P-92-001/R2.25, May 1992.
33.Martin De Prycker, ``Asynchronous Transfer Mode: Solution for Broadband ISDN'',
2nd edition, ELLIS HORWOOD, 1993.
34.ITU-T 1992 Recommendation I.362, ``B-ISDN ATM Adaptation Layer (AAL)
Functional Description,'' Study Group XVIII, Geneva, June 1992.
35.ITU2 ITU-T 1992 Recommendation I.363, ``B-ISDN ATM Adaptation Layer (AAL)
Specification,'' Study Group XVIII, Geneva, June 1992.
36.Bae et.al., ``Survey of Traffic Control Schemes and Protocols in ATM Networks,''
Proceedings IEEE, Feb. 1991, pp. 170-188.
37.G.Gallassi, G. Rigolio, and L. Fratta, ``ATM: Bandwidth Assignment and Bandwidth
enforcement policies, '' Proceedings of IEEE GLOBECOM 1989, pp. 49.6.1-49.6.6.
38.Jacquet and Muhlethaler, `An analytical model for the High Speed Protocol DQDB',
High Capacity Local and Metropolitan Area Networks, Springer-Verlag, 1990
39.Bertsekas and Gallagher, Data Networks, Prentice Hall, 1995
40. Pahlavan, K. and Levesque, A. "Wireless Information Networks", New York, Wiley
1995
41. Davis, P.T. and McGuffin C.R. "Wireless Local Area Networks - Technology, Issues
and Strategies", McGraw Hill, 1995
42. Infiniband Architecture Specification, Volume 1, Release 1.1, Infiniband Trade
Association
43. H. J. Chao and N. Uzun, "An ATM Queue Manager Handling Multiple Delay and
Loss Priorities," IEEE/ACM Transactions on Networking, Vol. 3, NO. 6, December
1995.
44. C.R. Kalmanek, H. Kanakia, S. Keshav, ``Rate Controlled Servers for Very HighSpeed Networks,'' Proceedings of the IEEE GLOBECOMM, 1990.
45. I. Dalgic, W. Chien, and F.A. Tobagi, "Evaluation of 10BASE-T and 100BASE-T
Ethernets Carrying Video Audio and Data Traffic," INFOCOM'94 Vol 3 1994, pp. 10941102.
46. Gopal Agrawal, Biao Chen, and Wei Zhao. Local synchronous capacity allocation
schemes for guaranteeing message deadlines with the timed token protocol. In IEEE
Infocom 93, volume 1, 1993.
Chapter 4
High Speed Communication Protocols
Objective of this chapter:
The main objective of this chapter is to review the basic principles of designing
High Speed communication protocols. We will discuss the techniques that have
been proposed to develop High Speed transport protocols to alleviate it of the
slow-software-fast transmission bottleneck associated with High Speed networks.
Key Terms
QoS, TCP, IP, UDP, Active Network, SIMD, MISD, Parallelism
With the recent developments in computer communication technologies, it is now
possible to develop High Speed computer networks that operate at data rates in gigabit
per second range. These networks have enabled high performance distributed computing
and multimedia applications. The requirements of these applications are more diverse and
dynamic than those of traditional data applications (e.g., file transfer). The existing
implementation of standard communications protocols do not exploit efficiently the high
bandwidth offered by high speed networks and consequently cannot provide applications
that required high throughput and low latency communication services. Moreover, these
standard protocols neither provide the flexibility needed to select the appropriate service
that matches the needs of the particular application, nor they support guaranteed Quality
of Service (QOS). This has intensified the efforts to develop new high performance
communication protocols that can utilize the enormous bandwidth offered by High Speed
networks.
4.1 Introduction
When networks operated at several Kbps, computers had sufficient time to receive,
process and store incoming packets. However, with high speed networks operating at
gigabit per second (Gbps) or even at terra bit per second (Tbps), the existing standard
communication protocols are not able to keep up with the network speed. Protocol
processing time becomes a significant portion of the total time to transmit a packet. For
example, a receiving unit has 146 milliseconds (146×10-3 seconds) to process a packet (of
size 1024 bytes) when the network operates at 56 Kbps, while the same receiving unit has
only 8 microseconds (8×10-6 seconds) to process packets of the same size when the
network operates at Gbps [Port, 1991].
89
In addition to the limited time, a computer has to process incoming packets transmitted at
Gbps rate, new applications are emerging (e.g., interactive multimedia, visualizations,
and distributed computing) that require protocols to offer a wide range of communication
services with varying quality of service requirements. Distributed computing increases
the burden on the communication protocols with additional requirements such as
providing low latency and high throughput transaction-based communications and group
communication capabilities. Distributed interactive multimedia applications involve
transferring video, voice, and data within the same application, where each media type
has different quality of service requirements. For example, a full motion video requires
high-throughput, low-delay, and modest error rate, whereas bulk data transfer requires
high throughput and error free transmission.
The current implementations of standard transport protocols are unable to exploit
efficiently high-speed networks, and furthermore cannot adequately meet the needs of the
emerging network-based applications. Many researchers have been looking for ways to
solve these problems. The proposed techniques followed one or more of three
approaches: improve the implementation of standard transport protocols, introduce new
adaptive protocols or implement the protocols by using a specialized hardware. In this
chapter we first present the basic functions of transport protocols such as connection
management, acknowledgment, flow control and error handling, followed by a discussion
on the techniques proposed to improve latency and throughput of communication
protocols.
4.2 Transport Protocol Functions
The transport layer is the fourth layer of the ISO OSI reference model. The transport
layer performs many tasks such as managing the end-to-end connection between hosts,
detecting and correcting errors, delivering packets in a specified order to higher layers,
and providing flow control [Port, 1991]. Transport protocols are required to utilize the
network communication services, whether they are reliable or not, and provide
communication services that meet the quality of service of the transport service users.
Depending on the disparity between the network services and the required transport
services, the design and implementation of the transport protocol functions vary
substantially. The main functions of a transport protocol are: 1) connection management
that involves initiating and terminating a transport layer association; 2) acknowledgment
of received data; 3) flow control to govern the transmission rate in order to avoid over
runs at the receiver and/or to avoid congesting the network; and 4) error handling in order
to detect and correct errors occurred during the data transmission if required by transport
service users.
4.2.1 Connection Management
Connection management mechanisms address the techniques that can be used for
establishing, controlling and releasing the connections required to transfer data between
two or more entities. They also deal with synchronizing the state of the communicating
entities. Where the state information is defined as the collection of data needed to
90
represent the state of transmission (e.g. the number of data already received, rate of
transmission, packets already acknowledged, etc.). Transport protocols can be classified
based on the amount of state information they maintain into two types: connectionless
and connection-oriented transport protocols.
In connectionless transport protocols, data are exchanged in datagrams without the need
to establish connections. The transport system pushes the datagrams into the network that
then attempts to deliver the datagrams to their destinations. There is no guarantee that
datagrams will be delivered or that they will be delivered in the order of their
transmission. Connectionless systems provide ``best-effort'' delivery. In this service,
there is no state information that needs to be retained at either the sender or receiver and
therefore the transport service is not reliable.
In connection-oriented transport protocols, a connection is established between the
communicating entities. Data is then transferred through the connection. The connection
is released when all the data has been transmitted. During the transfer, the state of the
entities is synchronized to maintain flow and error control, which ensures the reliability
of the transport system.
The main functions associated with connection management can be described in terms of
the techniques used to achieve signaling, connection setup and release, selection of
transport services, multiplexing, control information, packet formats, and buffering
techniques [Doer, 1990].
Signaling
The connection management functions can vary depending on the signaling protocol.
Signaling protocol is used to setup and release connections and for exchanging state
information. The signaling information can be transferred in two ways: on the same
connection used for data transfer (in-band signaling) or on a separate connection (out-ofband signaling). The in-band signaling increases the load on the receiving unit because
for every incoming packet it needs to determine whether or not the received packet is a
data or a control packet. The out-of-band signaling separates the data and control
information and sends them on different connections. Consequently, the transport
protocol can support additional services (billing or security) without directly impacting
the performance of the data transfer service.
Connection Setup and Release
The connection management of setup and release can follow two techniques: Handshake
(explicit) schemes or implicit (timer-based) schemes. In explicit connection setup,
handshake control packets are exchanged between the source and destination nodes to
establish the connection before the data is transferred. This means that there is extra
overhead in terms of packet exchanges and delay. The handshaking can follow either a
two or three way protocol. In addition to exchanging messages according to the
91
handshake protocol, information about the connection must be maintained for a given
period of time to ensure that both sides of a connection close reliably.
In implicit connection setup no separate packets are exchanged to establish the
connection. Instead, the connection is established implicitly by the first data packet sent
and later the connection is released upon the expiration of a timer. Similarly in handshake
schemes, state information must be maintained at both ends of the connection for a
predetermined period of time after the last successful exchange of data. Determining the
time period to hold state information is a difficult task, especially in networks that
experience unbounded network delays.
Selecting Transport Service
The transport protocol defines several mechanisms in order to adjust its services
according to the underlying supported network services and the applications of the
transport service. Most protocols (e.g., TCP, OSI/TP4, XTP) negotiate the parameters
(maximum packet size, timeout values, retry counters and buffer sizes) during the setup
of a connection. Other protocols (e.g., TCP, OSI/TP4, XTP, VMTP) update some
parameters (e.g., flow control parameters and sequence numbers) continuously during the
data transfer phase in connection oriented service mode. In addition to this, some
transport protocols support additional operation modes (e.g., no error control mode, block
error control mode) that can be selected dynamically during the data transfer phase.
In general, having further flexibility in choosing the protocol mechanisms and parameters
and modify the parameters or change the mode of operating is an important property for
supporting the emerging high performance network applications that vary significantly in
their quality of service requirements.
Multiplexing
Multiplexing is defined as the ability to connect several data connections from one layer
over a single connection provided by the lower layer [Doer, 1990]. Transport protocols
can be classified based on their ability to map several transport connections over one
single network connection. For example, TCP, OSI/TP4 and XTP support multiplexing at
the transport layer. In connection-oriented networks, the virtual circuit is used to
distinguish between the multiple transport connections that share the same network
connection. In datagram networks, transport layer packets are identified by the
association of the source and destination addresses. These packets share one single
network connection. Other protocols such as Datakit and APPN [Doer, 1990] do not
support multiplexing and there is one-to-one correspondence between transport and
network connections.
Multiplexing provides a cost-effective way to share the network resources, but it incurs
extra overhead in demultiplexing the packets at the receiver side and making sure the
shared resources are used fairly by all the transport layer connections or packets.
Furthermore, in High Speed broadband networks (e.g., ATM), the benefit of multiplexing
at the transport layer is questionable.
92
Control Information
The transport protocol entities at both ends exchange some control information to
synchronize their states, acknowledge receiving correctly or incorrectly the packets, flow
control, etc. The control information can be all packed into one control message that is
transmitted during the exchange of control state information. Another approach is to use
separate messages to pass information about some control state variables or put them in
the packet headers. Some protocols exchange control information periodically,
independent of other events related to the transport connection. This simplifies
significantly the error recovery scheme and facilitate parallel processing.
Packet Formats
The packet format defines layout of the packet fields that carry control information and
user data. The size and the alignment of packet fields have significant impact on the
performance of processing these packets. In the 1970s when the network was slow, the
main bottleneck was the packet transmission delay. That has led to the use of complex
packet formats with variable field lengths in order to reduce the amount of information
transmitted in the network. This results in increasing the complexity of packet processing
which was then acceptable because computers have plenty time to decode and process
these packets. However, with high speed networks, care must be taken to use fixed size
fields and properly allocate them in order to reduce packet decoding and allows parallel
processing of the packet fields. For example, XTP protocol makes most protocol header
fields 32 bits wide. Furthermore, the placement of the fields can impact significantly the
performance of packet processing. One can put the information that defines the packet in
the header to simplify decoding and multiplexing, whereas the checksum field should be
placed after the data being protected so that they can be computed in parallel with data
transmission or reception.
Buffer Management
Buffering at the transport layer involves, writing data to a buffer as it is received from the
network, formatting the data for the user by sorting packets and separating data from
header, moving the data to a space accessible by the user, and populating buffers for
outgoing packets. In general, protocol implementations should minimize the number of
memory read and write instructions since memory operations consume a large percentage
of the total processing time [Clar, 1989].
One optimization is to calculate the checksum as the data is being transferred from the
network to the user space. This requires that the CPU move the data, rather than a DMA
controller. This optimization has been shown to reduce the processing time required to
calculate the checksum and perform the memory-to-memory copy by 25%[Clar, 1989].
93
4.2.2 Acknowledgment
Acknowledgment is used to notify the successful receiving of user data and can be sent
either as explicit control messages or within the header of a transmitted message. For
example, in request-response protocols (e.g., VMTP), the response for each request is
used as an acknowledgment of that request. In general, acknowledgments are generated at
the receiver in response to explicit events (sender-dependent acknowledgment) or
implicit events (sender-independent acknowledgment). In sender-dependent
acknowledgment scheme, the receiver generates acknowledgment either after each packet
received or after a certain number of packets depending on the availability of resources
and how acknowledgments are used in the flow and error control schemes as will be
discussed in the next two subsections. In the sender-independent acknowledgment
scheme, the receiver periodically acknowledges messages received under timer control.
This scheme simplifies the receiver tasks since it eliminates the processing time required
to determine whether or not acknowledgments need to be generated when messages
arrive at the receiver side.
4.2.3 Flow Control
Flow Control mechanisms are needed to ensure that the receiving entity, within its
resources and capabilities, is able to receive and handle the incoming packets. Flow
control plays an important role in high speed networks because packets could be
transmitted at rates that overwhelm a receiving node. The node has no option except to
discard packets and that will consequently lead to a severe degradation in performance.
The transmission rate must be limited by the sustainable network transmission rate, and
the rate at which the transport protocol can receive, process, and forward data to its users.
Exceeding either of these two rates will lead to congestion in the network or losing data
caused by overrunning the receiver. The main goal of the flow control scheme is to
ensure that these restrictions on the transmission rates are not violated. The restriction
imposed by the receiver can be met by enforcing an end-to-end flow control scheme,
whereas the network transmission rate restriction can be met by controlling the access to
the network at the network layer interface.
The access control used by network nodes to protect its resources against congestion can
be either explicit or implicit. In explicit scheme as in XTP protocol, the transmitter
initially uses default parameters for its transmission. Later, the receiver modifies the flow
control parameters. Furthermore, the flow control parameters can also be modified by the
network nodes in order to protect itself against congestion. In other protocols, implicit
network access scheme is used. Round trip delays are used as indication of network
congestion. When this delay exceeds certain limit, the transmitter reduces significantly its
transmission rate. The transmission rate is increased slowly to its normal transmission
rate according to an adaptive algorithm (e.g., slow-start algorithm in TCP protocol).
The two main methods for achieving end-to-end flow control are window flow control
and rate flow control. In window flow control, the receiver specifies the maximum
amount of data that it can receive at one time until the next update of the window size.
94
The window size is based on many factors such as the available buffer size at the
receiver, and the maximum round trip network delay. The transmitter stops sending
packets when the number of outstanding, unacknowledged, packets is equal to the
window size. The window size is updated with the receipt of acknowledgments from the
receiver and consequently allows the transmitter to resume its transmission of packets.
Window update can be either accumulative (window update value is added to the current
window) or absolute (the window update specifies a new window from a reference point
in the packet number space). In absolute window update scheme, it is possible for the
receiver to reduce the size of the window whenever it is required by the flow control
scheme.
In Rate control, timers are used to control the transmission rate. The transmitter needs to
know the transmission rate and the burst size (the maximum amount of data that may be
sent in a burst of packets. The transmitter sends data based on a state variable(s) that
specifies the rate at which the data can be transmitted. A variant of the rate control
scheme is to use the interpacket time (the minimum amount of time between two
successive packet transmissions) instead of the transmission rate and the burst size.
Some protocols (e.g., XTP and NETBLT) combine the two schemes, they use initially
rate control for transmission as long as the window remains open and uses window-based
scheme once the window is closed. It is believed that rate control is more suitable for
high speed networks since it has low overhead and the rate can be adjusted to match the
network speed; rate flow control has low overhead since it exchanges little control
information (when it is required to adjust the transmission rate), and it is independent of
the acknowledgment scheme used in the protocol [Doer, 1990].
4.2.4 Error Handling
Error handling mechanisms are required to recover from lost, damaged or duplicated
data. If the underlying network does not guarantee transport user requirement of a reliable
data transmission, then transport protocols must provide error detection, reporting
techniques about the errors once they are detected, and error correction.
Error Detection: Sequence numbers, length field, and checksums can be used for error
detection. Sequence numbers are used to detect missing, duplicate, and out-of-sequence
packets. These sequence numbers can also be either packet-based (as in OSI/TP4,
NETBELT, VMTP) or byte-based (as in TCP, XTP). Length field can be used to detect
the complete delivery of data. The length field is placed either in the headers or the
trailers. VMTP protocol uses a slightly different scheme to indicate the length of the
packet. It uses the bitmap to indicate a group of fixed size blocks that are present in any
given packet.
Checksums are used at the receiver to detect corrupted packets incurred during the
transmission. Checksums can be applied to the entire packet or to only headers or
trailers. For example, TCP checksums the whole transport protocol data unit (TPDU),
95
while XTP, VMPT and NETBLT apply the checksum to the header with optional data
checksum.
Error Reporting: To speedup the recovery from errors, the transmitter should be
notified once the errors are detected. The error reporting techniques include negative or
positive acknowledgment and selective reject. In a negative acknowledgment (NACK),
NACK is used to identify the point from which data are lost or missing. In a selective
reject, more information is provided to indicate all the data that the receiver is missing.
Error Correction: The error recovery in most protocols is done by retransmission of the
erroneous data. However, they differ in the technique used to trigger retransmission of
the data such as timeouts or error reporting, and the amount of information they keep at
the receiver side; some protocols might discard out-of-sequence data or temporary buffer
it until receiving the missing data so the receiver can perform resequencing. Two
schemes are normally used in error correction: Positive Acknowledgment with
Retransmission (PAR) and Automatic Repeat Request (ARR).
In PAR, the receiving entity will acknowledge only correct data. Damaged and lost data
will not be reported. Instead, the transmitter will timeout and retransmit the
unacknowledged data using a go-back-n scheme. In go-back-n, all packets starting from
the unacknowledged (lost or corrupted) packets will be retransmitted. Other protocols
rely on the information provided by the error reporting and can thus performs selective
retransmission. This method can be used to recover from any error type and simplifies the
receiving entity tasks.
In Automatic Repeat Request (ARR), the receiving entity informs the transmitter the
status of the received data and based on this information the transmitter determines which
data needs to be retransmitted. The receiver can use two types of retransmission: goback-n or selective retransmission. In selective retransmission, only the lost (or
damaged) data will be retransmitted. The receiving entity must buffer all packets that are
received after the damaged (or lost) data and then re-sequencing them after it received the
retransmitted data. With go-back-n scheme, the transmitter begins retransmission of data
from the point where the error is first detected and reported. The advantage of the goback-n scheme is that the resulting protocol is simple since the receiver sends only
NACK when it detects errors or missing data and does not have to buffer out of sequence
data. However, the main disadvantage of this scheme is due to the bandwidth wasted in
retransmission of successfully received data because of the gap between what data is
received correctly at the receiver side and what has been transmitted at the sender side.
As will be discussed later, the cost of this scheme can be very significant in high speed
networks with large propagation delays; for the propagation delay period, the senders can
transmit a huge amount of data that needs to be stored in the network. Retransmission of
successfully received data can be avoided by using selective retransmission. This scheme
is effective if the receiver has enough buffer memory to store data corresponding to two
or three times the bandwidth-delay product [Bux, 1988]. However, this scheme is not
effective if the network error rate is high.
96
4.3 Standard Transport Protocols
The most commonly used transport protocols include the TCP/IP suite and OSI ISO TP4.
The OSI transport protocol [Tann, 1988] became an ISO international standard in the
1984 after a number of years of intense activity and development. To handle different
types of data transfers and the wide variety of networks that might be available to provide
network services, five classes have been defined for the ISO TP protocol. These are
labeled classes 0,1,2,3, and 4. Class 0 provides service for teletext terminals. Classes 2, 3,
and 4 are successively more complex, providing more functions to meet specified service
requirements or to overcome problems and errors that might be expected to arise as the
network connections become less reliable.
Transmission Control Protocol (TCP) [Tann, 1988 and Jain, 1990a], was originally
developed for use with ARPAnet during the late 1960's and the 1970's, has been adopted
as the transport protocol standard by the U.S. Department of Defense (DoD). Over the
years it has become a de facto standard in much of the U.S. university community after
being incorporated into BSD Unix. The Internet Protocol (IP) was designed to support
the interconnection of networks using an Internet datagram service and it has been
designated as a DoD standard. Figure 4-1 shows the TCP/IP protocol suite and its
relationship to the OSI reference model.
A p p lic a tio n
P r e se n ta tio n
A p p lic a tio n
P r o c e ss
S e ssio n
T r a n sp o r t
TCP
N e tw o r k
IP
D a ta L in k
C o m m u n ic a tio n
N e tw o r k
P h y sic a l
Figure 4-1. The TCP/IP protocol suite and the OSI reference model.
4.3.1 Internet Protocol (IP)
The ARPANET network layer provides a connectionless and unreliable delivery system.
It is based on the idea of Internet datagrams that are transparently transported from the
source host to the destination host, possibly traversing several networks. The IP layer is
unreliable because it does not guarantee that IP datagrams ever get delivered or that they
are delivered correctly. Reliable transmission can be provided using the services of upper
97
layers. The IP protocol breaks up messages from the transport layer into datagrams or
packets of up to 64 Kilobytes (kB). Each packet is then transmitted through the Internet,
possibly being fragmented into smaller units. When all the pieces arrive at the
destination machine, they are reassembled by the transport layer to reconstruct the
original message. An IP packet consists of a header part and a data part. The header has a
20 byte fixed part and a variable length optional part. The header format is shown in
Figure 4-2.
Bit no.
1
2
3
4
5
Version
6
7
8
1
2
IHL
3
4
5
6
7
8
Type of service
Total Length
Identification
Fragment Offset
Flags
Time to Live
Protocol
Header Checksum
Source Address
Destination Address
Options
Padding
Figure 4-2. The Internet Protocol (IP) header
The fields of the IP packet are described below:
Version: This field indicates the protocol version used with the packet.
IHL: The Internet header length since the header length is not fixed.
Type of service: packets can request special processing with various combinations of
reliability and speed. For example, for video data, high throughput with low delay is
far more important than receiving correct data with low throughput, and for text file
transfer, correct transmission is more important than speedy delivery.
Total length: This 16-bit field indicates the length of the packet for both header and data.
It allows packets to be up to 65,536 bytes in size.
Identification: When a packet is fragmented, this field allows the destination host to
reassemble fragments as in the original packet. All the fragments of a packet contain
the same identification value.
Flags: this is a 3-bit field to allow (prevent) gateways to (from) fragment the packet.
Fragment offset: When a gateway fragments a packet, the gateway sets the flags field
for each fragment except the last fragment and updates the fragment offset field to
98
indicate the fragment position in the packet. All fragments except the last one must be
a multiple of 8 bytes.
Time to live (TTL): An 8-bit TTL field is a counter to limit packet lifetimes. Since
different packets can reach the destination host through different routes, this field can
restrict packets from looping in the network.
Protocol: This field identifies the type of data and is used to demultiplex the packet to
higher level protocol software.
Header checksum: This field provides error detection for only the header portion in the
packet. It is useful since the header may change in fragments. It uses 16-bit arithmetic
to compute the one's complement of the sum of the header.
Source address and Destination address: Every protocol suite defines some type of
addressing that identifies networks and hosts. As illustrated in Figure 4-3, an Internet
address occupies 32 bits and indicates both a network ID and a host ID. It is
represented as four decimal digits separated by dots (e.g., 128.32.1.11). There are
four different formats.
7 b it s
2 4 b i ts
c la ss A
0
c la ss B
1 0
c la ss C
1 1 0
c la ss D
1 1 1 0
n e t id
h o s t id
1 4 b it s
n e t id
1 6 b it s
h o s ti d
2 4 b it s
n e tid
8 b it
h o s ti d
2 8 b i ts
m u lt ic a s t a d d r e s s
Figure 4-3. Internet address formats.
If a network has a lot of hosts, then class A addresses can be used because it assigns 24
bits to be used as host ID and thus it can accommodate up to 16 million hosts. Class C
addresses allow more networks (16 thousands networks) but fewer hosts per network.
Class D addresses are reserved for future use. Any organization with an Internet address
of any class can subdivide the available host ID space to provide subnetworks (see Figure
4-4). This feature provides a subnetworking feature to the internet address hierarchy:
subnetid and hostid within the hostid of the higher level of the hierarchy.
99
Standard
class B
1
0
14 bits
16 bits
netid
hostid
14 bits
8 bits
8 bits
subnetid
hostid
Subnetted
class B
1
0
netid
Figure 4-4. Class B Internet address with subnetting
4.3.2 Transport Control Protocol (TCP)
The transmission control protocol provides a connection-oriented, reliable, full-duplex,
byte-stream transport service to applications. A TCP transport entity accepts arbitrarily
long messages from upper layer (user processes), breaks them up into pieces (segments)
not exceeding 64 Kbytes, and sends each piece as a separate datagram. Since the IP layer
provides an unreliable, connectionless delivery service, TCP has the logic necessary to
provide a reliable, virtual circuit for a user process. TCP manages connections,
sequencing of data, end-to-end reliability, and flow control. Each segment contains a
sequence number that identifies the data carried in the segment. Upon receipt of a
segment, the receiving TCP returns to the sender an acknowledgment. If the sender
receives an acknowledgment within a time-out period, the sender transmits the next
segment. If not, the sender retransmits the segment assuming that the data was lost. TCP
has only one Transport Protocol Data Unit (TPDU) header format, which is at least 20
bytes, and it is shown in Figure4-5.
100
0
16
31
Source port
Destination port
Sequence number
Piggyback acknowledgment
FIN
SYN
RST
EOM
ACK
URG
TCP
header
length
TCP
header
Window
Checksum
Urgent pointer
Options (0 or more 32 bit words)
Data
Figure 4-5. TCP data unit format.
The fields of the TPDU are described below:
Ethernet 3
Ethernet 1
Ethernet 2
gateway
Host
Host
Host
Host
Host
Network ID
process
process
process
Host ID
Port number
Figure 4-6. Network ID, host ID, and port number in Ethernet
101
.
Source port and Destination port: These two fields identify the end points of the
connection. Each host assigns a 16-bit unique number for each port. Figure 4-6 shows
the relation of network ID, host ID and port number, while Figure 4-7 shows the
encapsulation of TCP data into an Ethernet frame.
16-bit TCP source port #
16-bit TCP dest. port #
data
protocol = TCP
internet 32-bit source addr
internet 32-bit dest. addr
frame type = IP
Ethernet 48-bit source addr
Ethernet 48-bit dest. addr
Ethernet
header
14
TCP
header
data
IP
header
TCP
header
data
IP
header
TCP
header
data
20
8
Ethernet
trailer
4
Ethernet frame
Figure 4-7. Encapsulation in each layer for TCP data on an Ethernet
Sequence number: This field gives the sequential position of the first byte of a segment.
Acknowledgment (ACK): This field is used for positive acknowledgment and flow
control; it informs the sender how much data has been received correctly, and how
much more the receiver can accept. The acknowledgment number is the sequence
number of the last byte arrived correctly at the receiver.
TCP header length: this field is similar to the header field in the IP header format.
Flag fields (URG, ACK, EOM, RST, SYN, FIN): The URG flag is set if urgent pointer
is in use; SYN to indicate connection establishment; ACK indicates whether
piggyback acknowledgment field is in use (ACK = 1) or not (ACK = 0); FIN
indicates connection termination; RST is used for resetting a connection that has
become ambiguous due to delayed duplicate SYNs or host crashes; and EOM
indicates End of Message.
Window: Flow control is handled using a variable-size sliding window. The window
field contains the number of bytes the receiver is able to accept. If the receiving TCP
does not have enough buffers, the receiving TCP set window field to zero to cease
transmission until the sender receives a non-zero window value.
Checksum: It is a 16-bit field to verify the contents of the segment header and data. TCP
uses 16-bit arithmetic and takes one's complement of the one's complement sum.
102
Urgent pointer: It allows an application process to direct the receiving application
process to accept some urgent data. If the urgent pointer marks data, TCP tells the
application to go into urgent mode and gets the urgent data.
4.4.3 User Datagram Protocol (UDP)
Not every application needs the reliable connection-oriented service provided by TCP.
Some applications only require IP datagram delivery service and thus exchange messages
over the network with a minimal protocol overhead. UDP is an unreliable connectionless
transport protocol using IP to carry messages between source and destination hosts.
Unreliable merely means that there are no techniques (acknowledgment, retransmission
of lost packets) in the protocol for verifying whether or not the data reached the
destination correctly.
There are a number of reasons for choosing UDP as a data transport protocol. If the size
of the data to be transmitted is small, the overhead of creating connections and ensuring
reliability may be greater than the retransmit of the entire data set. Like TCP, UDP
defines source and destination ports, and checksum. The length field gives the length in
bytes of the packet’s header and data. Figure 4.8 shows UDP packet format.
0
16
31
S ource port
D estination port
L ength
C hecksum
D ata
Figure 4.8. UDP packet format.
4.4 Problems with Current Standard Transport Protocols
The current standard protocols were designed in 1970s when the transmission rate was
slow (e.g., Kbps range) and unreliable. There is an extensive debate in the research
community to determine whether or not these standard protocols are suitable for high
speed networks and can provide distributed applications the required bandwidth and
quality of service. Many studies have shown that protocol processing, data copying and
103
operating system overheads are the main bottlenecks that prevent applications from
exploiting the full bandwidth of high speed networks. In this subsection, we highlight the
problems of the current implementation of standard transport protocol functions when
used in high speed networks.
4.4.1 Connection Management
Existing standard protocols may not be flexible enough to meet the requirements of a
wide range of distributed applications. For example, TCP protocol does not supply a
mechanism for fast call setup or multicast transmission; two important features for
distributed computing applications. In distributed computing applications, a short lived
connection is needed and the overhead associated with connection setup is prohibitively
expensive. Therefore, the option of implicit connection setup is a desirable feature. This
method is used in some new transport protocols like Versatile Message Transaction
Protocol (VMTP) and Xpress Transfer Protocol (XTP). This allows data to be transmitted
along with the connection setup packet. Furthermore, the response or acknowledgment
packets can also be used to transfer response data; in applications such as data base
queries and requests for file downloads, it is desirable that the response to any of these
requests would contain data and at the same time acknowledge the connection setup
packet or receiving the request.
Packet Format
A characteristic of older transport protocols is that packets format was designed to
minimize the number of transmitted bits. This is because previous transport protocols
were designed when the network transmission rate was in Kbps range that is several
orders of magnitude slower than the transmission rate of emerging high speed networks.
This has resulted in packet formats with bit-packed architectures that require extensive
decoding. Packet fields often have variable sizes to reduce the number of unnecessary
bits transmitted, and may change location within different packet types. This design,
while conserving bits, leads to unacceptable delays in acquiring, processing and storing
packets in high speed networks; In ATM networks operating at OC-3 transmission rate
(155 Mbps), the computer has roughly few microseconds to receive, process and store
each incoming ATM cell.
Consequently, the packet structure for high-speed protocols is critical. All fields within a
packet must be of fixed length and should fall on byte or word (usually multiple byte)
boundaries, even if this requires padding fields with null data. This leads to simpler
decoding and faster software implementation. In addition, by placing header information
in the proper order, parallel processing can reduce significantly the time needed to
process incoming packets. Furthermore, efficient transmission requires the packet size to
be large. The packet size in current transport protocols is believed to be too small for high
speed networks. It is desirable to make the packet size large enough to exploit the high
speed transmission rate and accommodate applications with bulk data transfer
requirement.
104
Buffer management
Buffer management processing is critical to achieve high performance transport protocol
implementation. The main responsibilities of buffer management are writing data to a
buffer as it is received, forwarding data to buffers for retransmission and reordering out
of sequence packets. One study has shown that 50% of TCP processing time is used for
network to memory copying [Port, 1991]. In layered network architectures, data is copied
between the buffers of the different layers. This exhibits a significant overhead. A
solution proposed by [Wood, 1989] called buffer cut-through. In this approach, one
buffer is shared between the different layers and only the address of the packet is moved
between the layers. This reduces the number of data copying. For bulk data transfer with
large packet size, this approach provides a significant performance improvements.
Another important issue is determining the buffer size that must be maintained at both the
sender and receiver ends. For the sender to transfer data reliably, it needs to keep in its
buffer all the packets that have been sent but not yet acknowledged. Furthermore, to
avoid the stop-and-wait problem in high speed networks, the buffer size should be large
enough to allow the sender to transmit packets in flight before stopping and waiting for
acknowledgment. At the receiver side, it is important to have enough buffer space to store
packets received out of sequence. One can estimate the maximum amount of buffering
required to be enough to store all packets that can be transmitted during one-round trip
delay over the longest possible path [Part, 1990].
4.4.2. Flow Control
The goal of flow control in a transport protocol is to match the data transmission rate with
the receiver’s data consumption rate. The flow control algorithms used in the current
transport protocols might not be suitable for high-speed networks [Port, 1991]. For
example, window-based flow control adjusts the flow of data by bounding the number of
packets that can be sent without any acknowledgement from the receiver. If the size of
the window chosen to be small or inappropriate, the transmitter will send short bursts of
data, separated by pauses, while the sender waits for permission to transmit again. This is
referred to as lock-step transmission and should be avoided especially in high speed
networks. In high speed networks, one needs to open a large window to achieve high
throughput over long delay channels. But opening a large window has little impact on the
flow control because windows only convey how much data can be buffered, rather than
how fast the transmission should be. Moreover, the window mechanism ties the flow
control and error control, and therefore becomes vulnerable in the face of data loss [Clar,
1989].
4.4.3. Acknowledgment
Most of the current transport protocol implementation use cumulative acknowledgment.
In this scheme, acknowledging the successful receiving of a packet with sequence
number, say N, indicates that all the packets up to N have been successfully received. In
105
high speed networks with long propagation delays, thousands of packets could have been
transmitted before the transmitter receives an acknowledgement. If the first packet
received in error, all the subsequent thousands of packets need to be retransmitted even
though they were received without error. This severely degrades the effective
transmission rate that can be achieved in high speed networks. Selective acknowledgment
has been suggested to make the acknowledgment mechanism more efficient. However, it
has more overhead as will be discussed in the next subsection. The concept of blocking
has also been suggested as a means to reduce the overhead associated with
acknowledgment [Netr, 1990].
4.4.4. Error Handling
Error recovery in most existing protocols use extensively timers that are inefficient and
expensive to maintain. When packets are lost, most reliable transport protocols use timers
to trigger the transmission state resynchronization. Variation in Round Trip Delays (RTD)
requires that timers be set to longer than one RTD. Finding a good balance between a
timer value that is too short and one that is too long can be very difficult. The
performance loss due to timers is particularly high when the round trip time is long. The
retransmission of undelivered or erroneous data, that can include an entire window of
data or only the erroneous data, can either be triggered by transmitter timeouts while
waiting for acknowledgment, or receiver initiating negative acknowledgment. The
receiver can use go-back-N or selective acknowledgment scheme for its
acknowledgment.
The Go-back-N algorithm is easy to implement, because once a packet is received in
error or out of sequence, all successive packets will be discarded. This approach reduces
the amount of information to be maintained and buffered. Selective acknowledgment a
scheme has more overhead since it needs to store all the packets received out-of-order. In
high speed networks, the transmission error rate is low (on the order of l0-9) and most of
the packet loss is due to network and receiver overruns. While Go-back-N is simple to
implement, it does not solve the problem of receiver overruns [Port, 1991]. Selective
acknowledgment schemes seem more suitable and their book keeping overhead can be
reduced by acknowledging blocks instead of packets. If any packet in a block is delivered
incorrectly, the entire block is retransmitted. As the window (maximum number of
blocks to send) size increases, the throughput increases.
One of the efficient error detection/correction methods is to transmit complete state
information frequently and periodically between the sender and receiver regardless of the
state of the protocol. This simplifies the protocol processing, by removing some of the
elaborate error recovery procedures, and makes it easy to parallelize the protocol
processing leading to higher performance [Netr, 1990]. Furthermore, periodic exchange
of state information with blocking makes the throughput independent of the round trip
delays while reducing the processing time. In this case, only one timer is needed, in the
106
receiver side, and must only be adjusted after each block time, not for each packet.
However, in interactive applications, this might not be the case because these applications
transmit a small amount of data, and latency is the key performance parameter [Port,
1991].
Although selective acknowledgement is more difficult to implement, the reduction in
retransmitted packets can offer higher retained throughput under errored or congested
conditions, and assist in alleviating network congestion. One study has shown that
selective retransmission can provide approximately twice the bandwidth of Go-back-N
algorithms under errored conditions. Results also showed that selective retransmission
combined with rate control provides high-performance method of communication
between two devices of disparate capacity [Port, 1991].
4.5 High Speed Transport Protocols
The applications of high performance distributed systems (distributed computing, realtime interactive multimedia, digital imaging repository, and compute-intensive
visualizations) have bandwidth and low latency requirements that are difficult to achieve
using the current standard transport protocols. Researchers have extensively studied the
limitation of these protocols and proposed a wide range of solutions to address the
limitations of current protocols in high speed networks. These techniques can be broadly
grouped into software-based and hardware-based techniques (see Figure 4.9).
High Speed Protocols
Software Techniques
Improve Existing
Protocols
Static Structure
Hardware Techniques
New Protocols
Adaptive Structure
VLSI
Parallel
Processing
Host
Interface
Programmable
Structure
Figure 4.9. Classification of high speed transport protocols.
The software-based approach can be further divided into those that aim at improving the
implementation of existing protocols using upcalls and better buffer management
techniques, or those that believe that the current protocols can not cope with high speed
107
networks and their applications so they propose new protocols that have either static or
dynamic structures (see Figure 4.9). The hardware-based approach aims at using
hardware techniques to improve protocol performance by implementing the whole
protocol using VLSI chips (e.g., XTP protocol), by applying parallel processing
techniques to reduce the processing times of protocol functions, or by off-loading all or
some of the protocol processing functions to the host network interface hardware. In this
subsection, we will briefly review the approaches used to implement high speed protocols
and discuss few representative protocols.
4.5.1 Software-based Approach
4.5.1.1 Improve Existing Protocol
Careful implementation of the communication protocol is very critical to improve
performance. It has been argued that the implementation of the current transport
protocols and not their designs is the major source of processing overhead [Clar, 1989].
The layering approach of standard protocols suffers from redundant execution of protocol
functions at more than one layer and from excessive data copying overhead. Some
researchers suggested improving the performance of standard protocols by changing their
layered implementation approach or reduce the overhead associated with redundant
functions and data copying. Clark [Darts] has proposed upcalls approach to describe the
structure and the processing of a communication protocol. The upcalls approach reduces
the number of context switching by making the sequence of protocol execution more
logical. Other researchers attempt to reduce data copying by providing better buffer
management schemes. Woodside et al. [Wood, 1989] proposed Buffer Cut-Through
(BCT) to share the buffer between the network layers. In the conventional method, each
layer has an input and output buffers shared only by the layer above it and below it as
shown in Figure 4.10(a). In BCT, data buffer is shared among all layers as shown in
Figure 4.10 (b). When layer i receives a signal from layer i+1 indicating a new packet,
layer i accesses the shared buffer to read the packet, processes it, and passes the address
of the packet along with a signal to layer I+1.
Layer i + 1
Layer i + 1
Buffer
Layer i
Buffer
Layer i
Layer i - 1
Layer i - 1
(a)
(b)
108
Shared
Buffer
Figure 4.10. (a) Buffers between each layer (b) Buffer Cut-Through approach.
Another example is the Integrated Layer Processing (ILP) [Clar, 1990; Brau, 1995; Abbo,
1993]. In ILP all data manipulations of different layers are combined together instead of
having data loaded and stored for every data manipulation required at each layer. This
reduces the overhead associated with data copying significantly.
X-Kernel [Hutc, 1991; Mall, 1992] is a communication-oriented operating system that
provides support for efficient implementation of standard protocols. X-Kernel uses
upcalls and an improved buffer management scheme to reduce drastically the amount of
data copying required to process the protocol functions. In this approach, a protocol is
defined as a specification of abstraction through which a collection of entities exchanges
messages. The X-Kernel provides three primitive communication objects: protocols,
sessions, and messages. A protocol object corresponds to a network protocol (e.g.,
TCP/IP, UDP/IP, TP4), where the relationships between protocols are defined at the
kernel configuration time. A session object is a dynamically created instance of a
protocol object that contains the data structures representing the network connection state
information. Messages are active objects that move through the session and protocol
objects in the kernel. To reduce the context switching overhead, a process executing in
the user mode is allowed to change to the kernel mode (this corresponds to making a
system call) and a process executing in the kernel mode is allowed to invoke a user-level
function (this corresponds to upcall). In addition, X-Kernel provides a set of efficient
routines that include buffer management routines, mapping and binding routines, and
event manage routines.
4.5.1.2 New Transport Protocols
The layered approach to implement communication protocols has an inherent
disadvantage for high speed communication networks. This occurs because of the
replication in functions in different layers (e.g., error control functions are performed
both by data link and transport protocol layers), high overhead for control messages, and
inability to apply efficiently parallel processing techniques to reduce protocol processing
time. Furthermore, the layered approach produces an optimal implementation for each
layer rather than producing an overall optimal implementation of all the layers. Many
new protocols were introduced to address the limitations of standard protocols. These
approaches tend to optimize the implementation of existing protocols to suite certain
classes of applications. In what follows, we discuss three representative research
protocols: Network Block Transfer (NETBLT) protocol, Versatile Message Transfer
Protocol (VMTP), Xpress transport protocol (XTP).
Network Block Transfer (NETBLT)
NETBLT is a transport protocol [Clar, 1987] intended for the efficient transfer of bulk
data. The main features of NETBLT are as follows:
Connection Management: NETBLT is a connection-oriented protocol. The sending and
receiving ends are synchronized on a buffer level. The per-buffer interaction is more
efficient than per-packet, especially in bulk data transfer applications over networks with
109
long delay channels. During data transfer, the transmitting unit breaks the buffer into a
sequence of packets. When the whole buffer is completely received, the receiving station
acknowledges it so that the transmitting node can move on to transmit the next buffer and
so on until it completes transmitting all the data. To reduce the overhead in synchronizing
the transmitting and receiving states, NETBLT maintains a timer at the receiving side.
The estimated time to transmit a buffer is used to initialize the control timer at the
receiving station. When the first packet of a buffer is received, the timer is set while it is
reset when the whole buffer is received.
Acknowledgement: The timer-based acknowledgement techniques are costly to maintain
and difficult to determine the appropriate timeout intervals. NETBLT minimizes the use
and the overhead associated with timers. NETBLT uses the timer at the receiver end.
Furthermore, NETBLT uses selective acknowledgement to synchronize the states at the
transmitter and receiver. In this acknowledgement scheme, NETBLT acknowledges the
receiving of buffers (large data blocks) in order to reduce the overhead associated with
acknowledgement of small packets.
Flow Control: NETBLT flow control mechanism adopts rate-based flow control. Unlike
window-based flow control, rate control works independently of the network round-trip
delay and error recovery. Consequently, the sender transmits data at the current
acceptable transmission rate as determined by the current network and receiver capacity
to handle incoming traffic. If the network is congested the rate must be decreased. Also,
there is no notion of error recovery working outside of the standard flow-control
mechanism since NETBLT places retransmitted data in the same queue with new data.
Thus all data leave the queue at the current transmission rate without any change.
Moreover, rate control reduces reliance on timer and estimates timers more accurately
since retransmission timer is based on the current transmission rate rather than round trip
delay.
Error Handling: NETBLT reduces the recovery time by placing the retransmission
timer at the receiver side. The receiver timer can easily determine the appropriate timeout
and which packets need to be retransmitted; the receiver can accurately estimate the
timeout period because it knows the transmission rate and the expected number of
packets based on the size of the buffer allocated at the receiver side [clark, 1988].
NETBLT synchronizes their connection states and recovery from errors via three control
messages (GO, RESEND, and OK) with selective repeat mechanism. Although the
control messages have more reliability, a control message will occasionally get lost.
Consequently, NETBLT maintains a control message retransmit timer based on network
round trip delay.
Versatile Message Transaction (VMTP)
VMTP is a transport protocol designed to support remote procedure calls, group
communications, multicast and real-time communications [Cher, 1989; Cher, 1986]. The
main features of this protocol can be outlined as follows:
110
Connection Management: VMTP is a request-response protocol to support transactionbased communications. Since most of its data units or transactions are small, VMTP uses
implicit connection setup. The first packet is used for establishing the connection as well
as carrying data. VMTP provides a streaming mode that can be used for file transfer,
conversation support for higher-level modules, a stable addressing, and message
transactions. The advantages of using transactions are higher-level conversation support,
minimal packet exchange, easy of use and sharing communication code.
Acknowledgment: VMTP uses selective acknowledgement to reduce the overhead
incurred by retransmit correctly received packets. VMTP reduces the overhead of
processing acknowledgement packets by using bit masking. The bit mask provides a
simple fixed length way of specifying which packets were received as well as indicating
the position of a packet within the packet stream.
Error Handling: VMTP employs different techniques to deal with duplicates, allowing
the most efficient technique (duplicate request packets, duplicate response packets,
multiple response to a group message transaction and idempotent transactions) to be used
in different situations. In contrast, TCP requires the use of a 3-way handshake for each
circuit setup and circuit tear-down to deal with delayed duplicates.
Flow Control: VMTP uses rate based flow control applied to a group of packets. In this
case, the transmitter sends a group of packets as one operation. Then, the receiver accepts
and acknowledges this group of packets as one unit before further data is exchanged. This
packet group approach simplifies the protocol and provides an efficient flow control
mechanism.
XTP Transport Protocol (XTP)
XTP is a communication protocol designed to support a wide range of applications
ranging from distributed computing applications to real-time multimedia applications
[Prot, 1989; Ches, 1987; Ches, 1987]. XTP provides all the functionalities supported by
standard transport protocols (TCP, UDP and ISO TP4) plus new services such as
multicasting, multicast group management, priority, quality of service support, rate and
burst control, and selectable error and flow control mechanisms. XTP architecture
supports efficient VLSI and parallel processing of the protocol functions.
Connection Management: XTP supports three types of connection management: 1)
Connection Oriented Service, Connectionless service, and Fast Negative
Acknowledgement Connection.
The XTP connection oriented service provides efficient reliable packet transmission. For
example, the TCP and ISO TP4 reliable packet transmission requires exchanging six
packets (two for setup and acknowledgement of the connection, two to send and
acknowledge data transmission, and two to close and acknowledge the release of the
connection). XTP uses three packets instead of six because of the use of implicit
connection establishment and release mechanisms.
The XTP connectionless service is similar to the UDP service that is a best effort delivery
service; the receiver does not acknowledge receiving the packet and the transmitter never
111
knows whether or not the transmitted packet is delivered properly to its destination. The
last option can lead to fast error recovery in special scenarios. For example, the receiver
that recognizes that the packets arriving out of sequence can notify the transmitter
immediately rather than waiting for the timeout to trigger the reporting.
Flow Control: XTP Flow control allows the receiver to inform the sender about the state
of the receiving buffers. XTP supports three mechanisms to implement flow control: 1)
Window based flow control that is used for normal data transmission; 2) Reservation
mode in order to guarantee that data will not be lost due to buffer starvation at the
receiver side; and 3) No flow control that disables the flow control mechanism. The last
mechanism might be useful for multimedia applications.
XTP supports rate-based as well as window-based flow control. XTP uses rate control to
restrict the size and time spacing of bursts of data from the sender. That is, within any
small time period, the number of bytes that the sender transmits must not exceed the
ability of the receiver (or intermediate routers) to process (decipher and queue) the data.
There are two parameters (RATE, BURST) for the receiver to tune the data transmission
rate to an acceptable level. Before the first control packet arrives at the transmitter, the
transmitter sends the packet at the default rate. The RATE parameter limits the amount of
data that can be transmitted per time unit, while the BURST parameter limits the size of
the data that can be sent.
Error Handling: XTP supports several mechanisms for error control such as go-back-n
and selective retransmission. TCP responds to errors by using a go-back-n algorithm that
might work in local area network, but it will degrade performance significantly in high
speed networks and/or high latency (satellite networks). XTP supports selective
retransmission, in addition to go-back-n, in which the receiver acknowledges spans of
correctly received data and the sender retransmits the packets in the gaps.
To speedup the processing of packet checksums, the checksum parameters are placed in
the common trailer in XTP (unlike TCP whose checksum parameters are placed in front
of the information segment).
4.5.1.3 Adaptable Transport Protocols
The existing standard transport protocols tend to be statically configured; that is they
define one algorithm to implement each protocol mechanism (flow control, error control,
acknowledgement, and connection management). However, the next generation of
network applications will have widely varying Quality of Service (QOS) requirements.
For example, data applications require transmission with no error and have no time
constraints, while real-time multimedia applications require high bandwidth coupled with
moderate delay, jitter, and some tolerance of transmission errors. Furthermore, the
characteristics of networks change dynamically during the execution of applications.
These factors suggest an increasingly important requirement on the design of next
generation high speed transport protocols to be flexible and adaptive to the changes in
network characteristics as well as to application requirements. In adaptive transport
protocol approach, the configuration of the protocol can be modified to meet the
requirements of the applications. For example, bulk data transfer applications run
112
efficiently if the protocol implements an explicit connection establishment mechanism,
whereas the protocol implements implicit connection management in transaction-based
applications; an explicit connection establishment exhibits an intolerable overhead. Many
techniques have been proposed to implement adaptive communication protocols [Dosh,
1992; Hosc, 1993; Hutc, 1989]. There are two methods to make a communication
protocol adaptive: 1) the various protocol mechanisms can be optionally configured and
incorporated during connection establishment by using genuine dynamic linking [Wils,
1991]; and 2) change the protocol mechanisms during the operation so protocols can
optimize their operations for any instantaneous changes in network conditions and
application QOS requirements.
Many researches have proposed protocols that are functionally decomposed into a set of
modules (or building blocks) that can be configured dynamically to meet the QOS
requirements of a given application. In [Eyas, 1996; Eyas, 2002 ], a framework in which
communication protocol configurations can be dynamically changed to meet applications
requirements and network characteristics is presented. In this framework, a
communication protocol is represented as a set of protocol functions, where each protocol
function can be implemented using one (or more) of its corresponding protocol
mechanisms. For example, flow control function can be implemented using windowbased mechanism or rate-based mechanism. Figure 4.11 describes the process of
constructing communication protocol configurations using this framework. The user can
either selects one of the pre-defined protocol configurations (e.g., TCP/IP) or customizes
a protocol configuration. All the protocol mechanisms are stored in the Protocol Function
DataBase (PFDB). Using the provided set of user-interface primitives, the user can
program the required protocol configuration by specifying the appropriate set of protocol
mechanisms.
■. Al-Hajery and S. Hariri, “Application-Oriented Communication Protocols for High-Speed
Networks,” International Journal of Computers and Applications, Vol. 9, No. 2, 2002, pp. 90-101.
✻✥❙ ❁▲✐ ✒✽
113
User
service parameters
Protocol Function
Data Base (PFDB)
predefined protocol
tailored protocol
protocolname
Protocol Generation Unit
Network
Monitor
protocol configuration
specifications
hardware
specifications
Protocol Implemeation Unit
protocol implementation
specifications
Hardware Platform
Figure 4.11. Adaptable Communication Protocol Framework.
Furthermore, a flexible hardware implementation of the configured protocol is used to
implement efficiently the communication protocols configured using that framework.
Figure 4.12 illustrates an example of three adaptable protocols where each protocol is
configured to meet the requirements of one class of applications. In Protocol 1, the
configuration contains an explicit connection establishment and release, a Negative
Acknowledgment (NACK) error reporting, a rate-based flow control, and a period
synchronization. In a similar manner, Protocols 2 and 3 are configured with different
types of protocol mechanisms.
Receiver
Conn. Mmgr.
EE
II
NACK
PACK
Transmitter
go-back-n
Trans-based
Rate-based
Periodic
Window
114
Protocol 1
Protocol 2
Protocol 3
Figure 4.12. A network-based implementation of adaptable protocol framework.
4.5.2 Hardware-based Approach
The main advantage of this approach is that it results in fast implementation of protocols.
However, the main limitation of hardware-based approach is that it is costly. This
approach can also be further classified into three types: programmable networks in which
the whole protocol functions are implemented by the network devices, Parallel
Processing approach in which multiple processors are used to implement the protocol
functions, and high–speed network interface approach in which some or all of the
protocol functions are off-loaded from the host to run on that interface hardware.
4.5.2.1 Programmable Network Methodologies
There are two main techniques that have been aggressively pursed to achieve network
programmability: 1) Open Interface: this approach is spearheaded by the Open sig
community, and 2) Active Network (AN): this approach is established by DARPA who
funded several projects that address programmability at network, middleware, and
application levels [CAMP, 2001; CLAV, 1998].
Open Interface:
This approach models the communication hardware using a set of open programmable
network interfaces that provide open access to switches and routers through well define
Application Programming Interfaces (API). These open interfaces provide access to the
node resources and allow third party software to manipulate or reprogram them to
support the required services. Consequently, new services can be created on top of what
is provided by the underlying hardware system as shown in Figure 4.13.
Expose functionalities of Network Element
(NE) to outside world
Algorithm
Open Interface
Resource
115
Figure 4.13. Open Interface Approach.
Recently, the IEEE Project 1520 [DENA, 2001] has been launched to standardize
programmable interfaces for network control and signaling for a wide range of network
nodes ranging from ATM switches, IP routers to mobile telecommunications network
nodes. The P1520 Reference Model (RM) is shown in Figure 4.14 along with the
functions of each of the four layers. The interfaces, as shown, supports a layered design
approach where each layer offers services to the layer above while it is using the
components below to build the layer services. Each level comprises a number of entities
in the form of algorithms or objects representing logical or physical resources depending
on the level’s scope and functionality. The P1520 model only defines the interfaces
leaving the actual implementation and protocol specific details to the vendor/
manufacturer.
P1520 Reference Model
Users
V
interfaceAlgorithms for value-added
}
communication services created by
network operators, users, and third parties
U
interface
Algorithms for routing and connection
management, directory services, …
L
interface
CCM
interface
Virtual Network Device
(software representation)
Physical Elements
(Hardware, Name Space)
Value-added
Services
Level
}
Network
Generic
Services
Level
}
}
Virtual
Network
Device Level
PE Level
Figure 4.14. IEEE P1520 Reference Model.
The reference model supports the following interfaces:
1) V interface – it provides access to the value-added services level
2) U interfaces – it deals with generic network services
3) L interface – it defines the API to directly access and manipulate local device
network resource states
4) Connection Control and Management (CCM) interface- it is a collection of
protocols (e.g., GSMP) that enable the exchange of state and control information
between a device and an external agent.
Active Network (AN)
This approach adopts the dynamic deployment of new services at runtime. In this
approach, code mobility (referred to as active packets) represents the main mechanism
116
for program delivery, control and service construction as shown in Figure 4.15. Packets
are the means for delivering code to the remote nodes and form the basis for network
control. Packets may contain executable code or instructions for execution of a particular
program at the destination node. Depending on the application, the packets may carry
data along with the instructions. The destination node based on the instructions process
the attached data of the packet. Usually the first packet(s) carry only the instructions and
the subsequent packets carry the data in a flow. The node at the other end processes the
data packets based on the instructions from the first packet(s). The packets that carry the
instructions are popularly referred to as mobile agents.
Active network provides greater flexibility in deploying of new services in a network but
also requires greater security and authentication. Before the execution of the code the
remote node needs to authenticate the user, its privileges and check to ensure that the new
operation is feasible under the current node system conditions and will not affect the
existing processes running in the node.
Flexibility to modify behavior of NE
(Active Signalling, etc.)
Active Device
Processing
Resource
Figure 4.15. Active Network Approach.
The active networks approach allows customization of network services at the packet
transport level, rather than through a programmable control interface. Active network
provides maximum complexity and it also adds extreme complexity to the programming
model.
4.5.2.2 Host-Network Interface
Every packet being transmitted or received goes through the host-network interface. The
design of host network interface plays an important role in maximizing the utilization of
117
the high bandwidth offered by high-speed networks. Recently, there has been an
intensified research effort to investigate the limitations of current host network interface
designs and propose new architectures to achieve high performance network interfaces
[Davi, 1994; Stee, 1992; Rama, 1992].
In a conventional network interface as shown in Figure 4.13, bus is used extensively
during the data transfers. Host processor reads data from the user buffer and writes it to
the kernel buffer. Protocol processing is performed, and then checksumming is calculated
per-byte by reading from the kernel buffer. Finally, the data is moved to the network
interface buffer from the kernel buffer. In this host-network interface, every word being
transmitted or received will cross the system bus six times.
Application
1
User Buffer
2
Host
Processor
3
Kernel
Buffering
4
5
6
Network Interface
1 Write data to user buffer
2,3 Move data from user to kernel buffer
4 Read data to calculate checksum code
5,6 Move data to network buffer
Network Buffer
Network
User Buffer
Application
Host
Kernel
Buffer
Processor
DMA
Network Interface
Network Buffer
118
Network
Figure 4.13. Bus-based host network interface.
Figure 4.14. DMA-based host network interface.
It is apparent that in the traditional network interface, there is a large number of data
transfers, and this throttles the speed of data transmission. This was not a problem when
this approach was first introduced in the early 70’s because the processor was three order
of magnitude faster than the network. However, this situation is reserved now and the
network is much faster than the processor in handling packets. In order to alleviate this
overhead, data copying should be minimized. In Figure 4.14, an improved host-network
interface is shown. In this design, Direct Memory Access (DMA) is used to move data
between kernel buffer and network buffer. Checksum can be calculated on the fly using
additional hardware circuitry while data is transferred. This makes the total number of
data transfers equals to four in contrast to six using the previous interface. Also,
checksum calculation is done in parallel with data transfer. This introduces some
additional per-packet overheads such as setting up the DMA controller. However, for
large data transfers, this additional overhead is insignificant when compared to the
performance gain achieved in reducing the number of data transfers.
The performance of host-network interface can be improved further by sharing the
network buffer with the host computer as shown in Figure 4.15 [Bank, 1993]. In this
design, data is directly copied from user buffer to network buffer (we refer to this design
as zero copying host network interface), checksumming is done on the fly while data is
copied. This design reduces the number of data transfers to only two. The network buffer
should be large enough to enable protocol processing and to keep copy of the data until it
is properly transmitted and acknowledged.
User Buffer
Application
Host
Processor
DMA
controller
Network Interface
Shared Buffer
Figure 4.15. Zero copying host network interface.
119
Network
In order to reduce the processing burden on the host, transport and network protocols can
be offloaded from the host to host-network interface processor as shown in Figure 4.16
[Macl, 1991; Jain, 1990b; Kana, 1988]. There are several advantages of offloading
protocol processing to the host-network interface such as reducing the load on the host
computer, making the application processing on the host deterministic since protocol
functions are executed outside the host, and eliminating the per-packet interrupt overhead
that can overwhelm the processing of packet transmission and reception (e.g., the host
can be interrupted only when a block of packets is received).
Application
Host
User Buffer
Processor
DMA
Protocol Processor
Protocol
General-purpose
Buffer
Processor
DMA
Network Interface
Network Buffer
Network
Figure 4.16. Off-loading protocol processing to host network interface.
4.5.2.3 Parallel Processing of Protocol Functions
The use of parallelism in protocol implementation is a viable approach to enhance the
performance of protocol processing [Ito, 1993; Brau, 1993; Ruts, 1992]. Since protocol
processing and memory bandwidth are major bottlenecks in high performance
communication subsystems, the use of multiprocessor platforms is a desirable approach
to increase their performance. However, the protocol functions must be adequately
partitioned in order to utilize multiprocessor systems efficiently. There are different types
and levels of parallelism that can be used to implement protocol functions. These types
are typically classified according to the granularity of parallelism unit (Coarse grain,
Medium grain, and Fine grain [Jain, 1990a]. A parallelism unit can comprise a complete
stack, a protocol entity, or a protocol function. Three types of parallelism can be
employed in processing protocol functions [Zitt, 1994]: 1) spatial parallelism which is
120
further divided into Single Instruction Multiple Data (SIMD)-like parallelism and
Multiple Instruction Single Data (MISD)-like parallelism, 2) temporal parallelism, and 3)
hybrid parallelism. Next, we describe the mechanisms used to implement each of these
types of parallelism.
SIMD-Like Parallelism
In this type of parallelism, identical operations are concurrently applied to different data
units (e.g., packets). Scheduling mechanisms (round-robin) may be employed to allocate
packets to different processing units. An SIMD organization requires only a minimum
synchronization among the processing units. However, it does not decrease the
processing time for a single packet; it increases the number of packets processed during a
certain time interval. Packets can be scheduled on a per-connection basis. In this case,
parallelism takes place among different connections. Instead, packets can be scheduled
on a per-packet basis, independent of their connection association. Synchronization
overhead in the per-packet is more than that of per-connection scheduling.
Jain et al [Jain, 1990b; Jain, 1992] have proposed a per-packet parallel implementation of
protocols. ISO TP4 transport protocol has been implemented on a multi-processor
architecture as shown in Figure 4.17. The main objective of this architecture is to be able
to handle data transfer rates in the excess of gigabit per second. The Input and Output
Low Level Processors (ILLP and OLLP) handle I/O, CRC, framing, packet transfer into
and out of the memory, and low level protocol processing. The multiprocessor pool
(MPP) handles the transport protocol processing. The HIP (see Figure 4.17) is a high
speed DMA controller to transfer packets to and from host.
L in e in
IL L P
L in e o u t
OLLP
P ro c e s s o r p o o l
P1
P2
. . . . .
Pn
H IP
H ost B us
IL L P : In p u t L o w L e v e l P ro c e s s o r O L L P : O u tp u t L o w L e v e l P ro c e ss o r
H IP : H o s t I n te rfa c e P ro c e s s o r
Figure 4.17. Multiprocessor implementation of TP4 transport protocol.
MISD-Like Parallelism
In this type of parallelism, different tasks are applied concurrently on the same data unit.
This type reduces the processing time required for a single packet since multiple tasks are
processed concurrently. However, a high degree on synchronization among the
121
processing units may be required. A protocol can be decomposed into different protocol
functions that can be applied concurrently on the same data unit. This is called perfunction parallelism. This type of parallelism incurs the highest synchronization
overhead. A parallel implementation of ISO TP4 has been realized on a transputer
network [Zitt, 1989]. A transputer is a 32-bit single chip microprocessor with four
bidirectional serial links that are used to build transputer networks. Communication
between processes is synchronized via the exchange of messages. The protocol is
decomposed into a set of functions that are also decomposed into a send and a receive
path. Protocol functions are then mapped to different transputers.
Temporal Parallelism
This type of parallelism operates on the layered model of the network architecture using
pipeline approach as shown in Figure 4.18. To achieve pipelining, the processing task has
to be subdivided into a sequence of subtasks each mapped on a different pipeline stage.
Thereby, each pipeline stage processes different data at the same point in time. Pipelining
does not decrease the processing time needed for a single packet. Since the performance
of a pipeline is limited by its slowest stage, stage balancing is an important issue to
increase the system throughput. The protocol stack can also be divided into a send and
receive pipelines. Control information that describes the connection parameters (e.g., next
expected acknowledgment and available window) is shared between the two pipelines.
S end p ip eline
layer 1
layer 2
...
layer n
...
layer n
shared data
R eceive pipeline
layer 1
layer 2
Figure 4.18. Temporal Parallelism based on pipeline approach.
Hybrid Parallelism
This type of parallelism corresponds to a mix of more than one parallelism type on the
same architecture. For example, multiple processing units can process different packets
concurrently (using connection level), each processing unit is a pipeline.
122
Summary
The new technology made High Speed networks operate at terra bit per second (Tbps).
This enabled the Distributed Interactive multimedia applications that involve transferring
video, voice, and data within the same application with each media type has different
quality of service requirements. In this chapter, we spotted the light on the proposed
techniques to implement of standard transport protocols, introduce new adaptive
protocols or implement the protocols by using a specialized hardware. So we discussed
the main functions that must be performed by any transport protocol. In addition, we have
identified the main problems that prevent the standard transport protocols (e.g., TCP,
UDP) from exploiting high speed networks and the emerging parallel and distributed
applications. We have also classified the techniques used to develop High Speed
communication protocols that are broadly classified into two types: Hardware based and
software based approaches. In the software approach, you can either improve existing
protocols or introduce new protocols that can be static or dynamic. In the hardware
approach, you can offload the protocol processing to either special hardware or to the
host-network interface. High performance host network interface designs reduce the
overhead associated with timers and data copying.
Problems
1) Describe briefly the transport protocol function with specific details in its tasks
and functions
2) What are the advantages and disadvantages of TCP/IP Protocol?
3) Briefly describe the problems of the current standard protocol and propose your
own solutions?
4) What are the solutions proposed in the book to solve the current protocol’s
problems? Discuss and compare with your answer in problems 4.
123
References
[Jain90a] N. Jain, M. Schwartz and T. R. Bashkow, "Transport Protocol Processing at
GBPS," Proceedings of the SIGCOMM Symposium on Communications Architecture
and Prtocols, pp. 188-198, August 1990.
[Tane88] A. S. Tanenbaum, Computer Networks, 2nd Edition, Prentice-Hall, 1988.
[Port91] T. La Porta, M. Shwartz, "Architectures, Features, and Implementation of HighSpeed Transport Protocol," IEEE Network Magazine, May 1991.
[Netr90] N. Netravali, W.D. Roome, and K. Sabnani," Design Implementation of HighSpeed Transport Protocol," IEEE Transaction on Communications, Nov. 1990.
[Clar87] D. D. Clark, M. L. Lambert, L. Zhang, " NETBLT: A high Throughput
Transport Protocol," Proceedings ACM SIGCOMM'87 Symp., vol. 17, no. 5, 1987.
[Clar89] D.Clark, V.Jacobson, J.Romkey, and H. Salwen, “An Analysis of TCP
Processing Overhead,” IEEE Communications Mahazine, pp.23-29, 1989
[Part90] C. Partridge, " How Slow is One Gigabit Per Second," in Symp.on Applied
Computing, 1990.
[Coul88] G. F. Coulouries, J. Dollimore, " Distributed Systems: Concepts and Design,"
Addison-Wesley Publishing Company Inc., 1988.
[Doer90] A. Doeringer, D. Dykeman, M. Kaiserswerth, B. Meister, H. Rudin, and R.
Williamson, “A Survey of Light-Weight Transport Protocols for High-Speed Networks, “
IEEE Transactions on Communications, Vol. 11, No. 11, November 1990, pp 2025-2038.
[Bux88] W. Bux, P. Kermani, and W. Kleinoeder, “Performance of an improved data link
control protocol,” Proceedings ICCC’88, October 1988.
[Bank93] D. Banks, and M. Prudence, “A High-Performance Network Architecture for a
PA-RISC Workstation”, IEEE Journal on Selected Areas in Commu. Vol 11, No.2,
pp191-202, 1993
[Ito93] M. Ito et al “A Multiprocessor Approach for Meeting the Processing Requirement
for OSI”, IEEE Journal on Selected Areas in Commun., Vol 11, No., 2, pp. 220-227,
1993
[Macl91] R. Maclean and S. Barvick, “An Outboard Processor for High Performance
Implementation of Transport Layer Protocols,” Proceeding of GLOBECOM’91, pp.17281732
124
[Hutc91] N.Hutchinson and L.Peterson, “The x-Kernel: An Architecture for
Implementation Networks Protocol,” IEEE Trans. On Software Eng., Vol. 17, No.,1, pp.
64-75, 1991
[Cher89] D.Cheriton and C. Williamson, “VMTP as the Transport Layer for Highperformance Distributed Systems,” IEEE Commun. Magazine, June 1989
[Cher86] D.Cheriton, “VMTP: A Protocol for the Next Generation of Communication
Systems,” Proc. Of ACM SIGCOMM, 1986
[Wood89] C.M.Woodside, J.R.Montealegre, “The Effect of Buffering Strategies on
Protocol Execution Performance,” IEEE Trans. On Communications, Vol.37, No.6,
pp.545-553, 1989
[Davi94] B.Davie et al, “Host Interfaces for ATM Networks,” in High Performance
Networks: Frontiers and Experience, Edited by A. Tantawy, Kluwer Academic
Publishers, 1994
[Stee92] P.Steenkiste et al., “A Host Interface Architecture for High-Speed Networks,”
Proc. IFIP Conf. On High-Performance Networks, 1992
[Rama92] K.K.Ramakrishnan, “Performance Issues in the Design of Network Interfaces
for High-speed Networks,” Proc. IEEE Workshop on the Arch. and Implementation of
High Performance Commun. Subsystems, 1992
[Jain90b] N.Jain et al., “Transport Protocol Processing at GBPS Rates,” Proc. ACM
SIGCOMM, August 1990
[Kana88] H.Kanakia and D. Cheriton, “The VMP Network Adapter Board (NAB): Highperformance Network Communication for Multiprocessors,” Proc. ACM SIGCOMM,
August 1988
[Brau93] T.Braun and C.Schmidt , “Implementation of a Parallel Transport Subsystem on
a Multiprocessor Architecture,” Proc. High-performance Distributed Comp. Symp, July
1993
[Ruts92] E.Rutsche and M.Kaiserswerth, “TCP/IP on the Parallel Protocol Engine,” Proc.
IFIP fourth International Conference on High Performance Networking, Dec. 1992
[Zitt94] M.Zitterbart, “Parallelism in Communication Subsystems,” in High Performance
Networks: Frontiers and Experience, Edited by A.Tantawy, Kluwer Academic
Publishers, 1994
[Jain92] N.Jain et al., “ A Parallel Processing Architecture for GBPS Throughput in
Transport Protocols,” Proc. International Conference on Communication, 1992
125
[Zitt89] M.Zitterbart, “High-Speed Protocol Implementations based on a Multiprocessor
Architecture,” in H.Rudin and R.Williamson(editors), Protocols for High-Speed
Networks, Elsevier Science Publishers, 1989
[Clar90]D.Clark and D.Tennenhouse, “Architecture Considerations for a New Generation
of Protocols,”, Proc. ACM SIGCOMM Symp., 1990
[Brau95] T.Braun and C. Diot, “Protocol Implementation Using Integrated Layer
Processing,” Proc. ACM SIGCOMM, 1995
[Abbo93] M.Abbott and L.Peterson, “Increasing Network Throughput by Integrating
Protocol Layers,” IEEE/ACM Trans. On Networking, Vol. 1, No. 5, pp.600-610, Oct.
1993
[Mall92] S.O’Malley and L. Peterson, “A Dynamic Network Architecture,” ACM Trans.
Computer Systems, Vol. 10, No. 2, pp.110-143, May 1992
[Eyas96] Eyas Al-Hajery, “Application-oriented Communication Protocols for Highspeed Networks” Doctor Dissertation, Syracuse University, 1996
[Dosh92] B.T.Doshi and P.K.Johri, “Communication Protocols for High Speed Packet
Networks,” Computer Networks and ISDN Systems, No.24, pp.243-273, 1992
[Hosc93] P. Hoschka, “Towards Tailoring Protocols to Application Specific
Requirements,” INFORCOM’93, pp.647-653, 1993
[Hutc89] N.C.Hutchinson, et al, “Tools for Implementing Network Protocols,” SoftwarePractice and Experience, 19(9), pp.895-916, 1989
[Wils91] W.Wilson Ho. Dld. “A Dynamic link/Unlink Editor,” Version 3.2.3, 1991.
Available by FTPing /pub/gnu/did-3.2.3.tar.Z at metro.ucc.su.oz.au
[Prot89] “XTP Protocol Definition Revision 3.4”, Protocol Engines, Incorporated, 1900
State Street, Suite D, Santa Barbare, Califorrnia, 1989
[Ches87a] G. Chesson, “The Protocol Entine Project,” UNIX Review, Vol.5, No. 9, Sept.
1987
[Ches87b] G. Chesson, “Protocol Engine Design,” USENIX Conference Proceedings,
Phoenix, Arizona, June 1987
[DENA 01]
Spryros Denazis, et al, Designing IP Router L-Interfaces, IP
Subworking Group IEEE P1520, Document No. P1520/TS/IP-005.
126
[CAMP 01]
Andrew Campbell, et al, A Survey of Programmable Networks,
COMET Group, Columbia University.
[CLAV 98]
K. Clavert, et al, Architectural Framework for Active Networks,
Active Networks Working Group Draft, July 1998.
127
High Performance Distributed Computing
Chapter 5
Distributed Operating Systems
Objective of this Chapter
The main objective of this chapter is to review the main design issues of distributed
operating systems and discuss representative distributed operating systems as case studies.
Further detailed description and analysis of distributed operating system designs and
issues can be found in textbooks that focus on distributed operating systems [Tane95,
Sing94,Chow97].
5.1 Introduction
Operating system is a system program that acts as an intermediary between a user of
computing system and the computer hardware. It manages and protects the resources of
the computing system while presenting the user with a friendly user interface. Many
operating systems have been introduced in the past few decades. They can be grouped
into single user, multi-user, network, and distributed operating systems.
In a single-user environment, the operating system runs on a single computer with one
user. In this case, all the computer resources are allocated to one user (e.g., MSDOS/Windows). The main functions include handling interrupts to and from hardware,
I/O management, management of memory, file and naming services. In a multi-user
environment, the users are most likely connected through terminals. The operating
system in this environment is quite a bit more complex. In addition to all the tasks
associated with a single user operating system, the operating system is responsible for the
scheduling of processes from different users in a fair manner, Inter-Process
Communication (IPC) and resource sharing.
A Network Operating System (NOS) is one that resides on a number of interconnected
computers and allows users to easily communicate with any other computer or user in the
system. The important factor here is that the user is aware of what machine he or she is
working on. The network operating system can be viewed as an additional system
software to support processing user requests and applications to run on
remote machines with different operating systems.
A Distributed Operating System (DOS) runs on a collection of computers that are
interconnected by a computer network and it looks to its users like an ordinary
centralized operating system [tane85]. The main focus here is on providing as many types
of transparencies (access, location, name, control, data, execution, migration,
performance, and fault tolerance) as possible. In a distributed operating system
environment, the operating system gives the illusion of having a single time shared
computing system. Achieving the single image system feature is extremely difficult
because of many factors such as the lack of global clock and a system-wide state of the
resources, the geographic distribution of resources, the asynchronous and autonomous
interactions among resources, just to name a few.
124
High Performance Distributed Computing
What is a Network Operating System?
In a network operating system environment, as defined previously, the users are aware of
the existence of multiple computers, and can log in to remote machines and copy files
from one machine to another. Each computer runs its own local operating system and has
its own user [tane87]. Basically, the NOS is a traditional operating system with some
kind of network interface and some utility programs to achieve remote login and remote
file access. The main functions are provided by the local operating systems, and the NOS
is called by the local operating system to access the network and its resources. The major
problem with the NOS approach is that it is "force fit over the local environment and
provide access transparency to many varied local services and resources [fort85]. The
main functions to be provided by a NOS are [fort85]:
• Access to network resources
• Provide proper security means for the system resources
• Provide some kind of network transparency
• Control the costs of network usage
• A realizable service to all users
To implement such functions or capabilities, the NOS provides a set of low level
primitives that can be classified into four types:
• User communication primitives that commonly referred to as mail facilities to
enable communications between users and/or processes
• Job migration primitives that allow processes or loads to be distributed in order to
balance system load and improve performance
• Data migration primitives that provide the capability to move data and programs
reliably from one machine to another
• System control primitives that provide the top level of control for the entire
network system. It is responsible for knowing the configuration of the network as
well as reconfiguration and re-initialization of the entire system in the event of a
failure.
What is Distributed Operating Systems (DOS)?
The basic goal of a distributed operating system is to provide full transparency by hiding
the fact that there are multiple computers and thus to unify different computers into a
single integrated computing and storage environment [dasg91]. This means that DOS
must hide the distribution aspect of the system from users and programmers. Designing a
DOS that provides all forms of transparency (access, location, concurrency, replication,
failure, migration, performance, and scaling transparency) is a challenging research
problem and most current DOS systems support a subset of these transparency forms.
The main property that distinguishes a distributed system from a traditional centralized
system is the lack of up-to-date global state and clock. It is not possible to have instant
information about the global state of the system because of the lack of a physical shared
memory, where the state of the system can be represented. Thus a new dimension is
added to the design of operating system algorithms that must address obtaining global
125
High Performance Distributed Computing
state information using only local information and approximate information about the
states of remote computers. Because of these difficulties, most of the operating systems
that control current distributed systems are just extensions of conventional operating
systems to allow transparent file access and sharing (e.g., SUN Network File System
(NFS)). Because these solutions are just short term, a more elegant design is preferable
to fully exploit all the potential benefits of distributed computing systems.
A distributed operating system when it is built from scratch does not encounter the
problems of force-fitting itself to an existing system as it is usually done in NOS. The
general structure of a DOS is shown in Figure 5.1. DOS distributes its software evenly
on all the computers in the system and its responsibilities can be outlined as follows
[fort85]:
•
•
•
•
•
•
•
•
•
•
•
•
Global Inter-Process Communication (IPC)
Global resource monitoring and allocation
Global job or process scheduling
Global command language
Global error detection and recovery
Global deadlock detection
Global and Local memory management
Global and local I/O management
Global access control and authentication
Global debugging tools
Global synchronization primitives
Global exception handling
Distributed Operating System
global
user command
management
local
user command
management
global
memory
management
local
memory
management
global
CPU
management
local
CPU
management
global
file
management
local
file
management
one to many
Figure 5.1. A general structure of a DOS.
126
global
I/O
management
local
I/O
management
High Performance Distributed Computing
DOS v.s. NOS: The network operating system can be viewed as an extension of an
existing local operating system, where a distributed operating system appears to its users
as a traditional uniprocessor time shared operating system, even though it is actually
composed of multiple computers [tane92]. The transparency issue in DOS is the main
feature that creates the illusion of being a centralized time-shared computing system.
Furthermore, DOS is more complex than the NOS; a DOS requires more than just adding
a little code to a regular operating system. One of the factors that increases the
complexity of a DOS is the parallel execution of programs on multiple processors. This
greatly increases the complexity of the operating system's process scheduling algorithms.
5.2 Distributed Operating System Models
There are two main models to develop a DOS: object-based and message-based (some
authors refer to this model as process-based) models. The object-based model views the
entire system and its resources in term of objects. These objects consist of a type
representation and a set of methods, or operations that can be performed on these objects.
An active object is a process and a passive object is referred to as data. In this
environment, for processes to perform operations in the system, they must have
capabilities for the objects to be invoked. The management of these capabilities is the
responsibility of the DOS. For example, a file can be viewed as an object with methods
operating on that file such as read and write operations. For a process to use this object,
it must own or have the capabilities to access that object and run some or all of its
methods. The object-based distributed operating systems have been built either on top of
an existing operating system or built from scratch.
Message-based model relies as its name implies on the Inter-Process Communication
(IPC) protocols to implement all the functions provided by the DOS. Within each node
in the system, a message passing kernel is placed to support both local and remote
communications. Messages between processes are used to synchronize and control all
their activities [fort85]. The system services are requested via message passing whereas
in traditional non-DOS systems like UNIX, this is done via procedure calls. For example,
these messages can be used to invoke semaphores to implement mutual exclusion
synchronization. Message-based operating systems are attractive because the policy used
to implement the main distributed operating system services (e.g. memory management,
file service, CPU management, I/O management, etc.) are independent of the InterProcess Communication mechanism[dasg91].
5.3 Distributed Operating System Implementation Structure
The implementation structure of an operating system indicates how the various
components of any DOS are organized. The structures used to implement DOS include
monolithic kernel, micro-kernel and object-oriented approach.
5.3.1 Monolithic Structure
127
High Performance Distributed Computing
In this structure, the kernel is one large module that contains all the services that are
offered by the operating system as shown in Figure 5.2. The services are requested by
storing specific parameters in well-defined areas, e.g. in registers or in the stack, and then
executing a special trap instruction. This allows the control to be transferred from the
user space to the operating system space. The service is then executed in the operating
system mode until completion and after that the control is returned back to the user
process. This kind of structuring is not well suited for the design of distributed operating
systems since there are a wide range of computer types and resources (e.g. diskless
workstations, compute servers, multi-processor systems, file servers, name servers,
database servers); in such an environment, each computer or resource is suited for some
special task and thus it is not efficient to load the same operating system on each
computer or resource; a print server will utilize heavily the DOS services related to files
and I/O printing functions. Consequently, loading the entire DOS software to each
computer or resource in the system will lead to wasting a critical resource (e.g. memory)
unnecessarily and thus degrading the overall performance of the system.
Dynamically loaded server program
… ...
S4
Server
Monolithic Kernal
S1
S2
S3
Kernel code and data
... ...
Figure 5.2. A Monolithic Operating System Structure
5.3.2 Microkernel Structure
In this approach, the system is structured as a collection of processes that are largely
independent of each other as shown in Figure 5.3. The heart of the operating system, the
micro-kernel, resides on each node and provides the essential functions or services
required by all other services or applications. The main micro-kernel services or
functions include communication, process management, and memory management. Thus
the micro-kernel software can be made small, efficient, and having lesser number of
errors as compared to a monolithic kernel. The micro-kernel structure supports the design
technique that aims at separating the operating system policy and mechanisms. The
operating system policy can be changed independently, without affecting the underlying
mechanism (e.g., the policy of granting an entity rights to a user might change depending
on the needs, whereas the underlying mechanism of implementing access control will
remain unaffected.
Microkernel
Dynamically loaded server program
S1
Server
S2
S3
S4
... ...
Kernel code and data
128
High Performance Distributed Computing
Figure 5.3. A Micro-kernel Operating System Structure
.5.3.3 Object Oriented Structure
In this structure, the services and functions are implemented as objects rather than being
implemented as independent processes as it is in the micro-kernel approach. Each object
encapsulates a data structure and defines a set of operations that can be carried out on that
data structure. Each object is given a type that defines the properties of the object. The
encapsulated data structure that can be accessed and modified by performing defined
operations, referred to as methods, on that object. As in the micro-kernel approach, the
services can be built and maintained independently of other services. It supports the
separation of policies from the implementation mechanisms.
5.4 Distributed Operating System Design Issues
A Distributed Operating System designer must address additional issues such as interprocess communications, resource management, naming service, file service, protection
and security, and fault tolerance. In this section, we discuss the issues and options
available in designing these services.
5.4.1 Inter-Process Communications (IPC)
In a distributed system environment processes need to communicate without the use of
shared memory. Message passing and Remote Procedure Call (RPC) are widely used to
provide Inter-Process Communication (IPC) services in distributed systems.
Message Passing
The typical implementation of the message passing model in distributed operating
systems is the client-server model [tane85]. In this model, the requesting client specifies
the desired server or service and initiates the transmission of the request across the
network. The server receives the message, responds with the status of the request, and
provides an appropriate buffer to accept the request message. The server then performs
the requested service and returns the results in a reply message. Three fundamental
design issues must be addressed in designing an IPC scheme:
1) are requests blocking or non-blocking
2) is the communication reliable or unreliable
3) are messages buffered or unbuffered
These issues are handled by system primitives. The selection of primitives depends
mainly on the types of transparencies to be supported by the distributed system and the
targeted applications.
In non-blocking send and receive primitives, the send primitive will return control to the
initiating process once the request has been queued in an outgoing message queue. When
the send message has been transmitted, the sending process is interrupted. This indicates
that the buffer used to store the send is now available. At the receive side the receiving
process indicates it is ready to receive the message and then goes to sleep. It then
129
High Performance Distributed Computing
suspends processing until awakened by the arriving message. This method is more
flexible than the blocking method. The message traffic is asynchronous and messages
may be queued for a long time. However, this complicates programming and debugging.
If a problem occurs, the asynchronous behavior makes the process of locating and fixing
the problem very difficult and time consuming. For this reason, many designers use
blocking send and receive primitives to implement IPC in distributed operating systems.
Blocking signals return control to the sending process only after the message is sent.
Unreliable blocking occurs if control is returned to the sending process immediately after
the message is sent, providing no guarantee that the request message has reached the
destination successfully. Reliable blocking occurs when control is returned to the sending
process only after receiving an acknowledgement message from the receiving process.
Similar to the blocking send, the blocking receive does not return control until the
message has been received and placed in the receive buffer. If a reliable receive is
implemented, the receiving process will automatically acknowledge the receipt of the
message.
Most distributed operating systems support both blocking and non-blocking send and
receive primitives. The common method is to use a non-blocking send with a blocking
receive. In this case, you assume the transmission of the message is reliable and there is
no need to block the sending process once the message is copied into the transmitter
buffer. The receiver, on the other hand, needs to get the message to perform its work
[chow97]. However, there are situations where the receiver is handling requests from
several senders. It is therefore desirable to use non-blocking receive in such cases.
Another important design issue is whether or not to provide message buffering. If no
buffering of messages is used, both the sender and the receiver must be synchronized
before the message is sent. This approach requires establishing the connection explicitly
in order to eliminate the need to buffer the message and thus the message size can be
large. When a buffer is used, the operating system allows the sender to store the message
in the kernel buffer. Using non-blocking sends, it is possible to have multiple outstanding
send messages.
In general it is a good practice to minimize the number of communication primitives
allowed in the distributed operating system design. Most distributed operating systems
adopt a request/response communication model that requires only three communication
primitives (e.g., Ameoba). The first primitive is used to send a request to a server and to
wait on the reply from the server. The second primitive is to receive the requests that are
sent by the clients. The third primitive is used by the server to send the reply message to
the client after the request has been processed.
Remote Procedure Call
The use of remote procedure calls for inter-process communication is widely accepted.
The use of procedures to transfer control from one procedure to another is simple and
well understood in high-level language programming. This mechanism has been extended
to be used in distributed operating systems by providing that the called procedure will, in
130
High Performance Distributed Computing
most cases, activate and execute on another computer. The transition from a local process
to a remote procedure is performed transparently to the user and is the responsibility of
the operating system. Although this method appears simple, a few important issues must
be addressed. The first one is the passing of parameters. Parameters could be passed
either by value or reference. Passing parameters by value is easy and involves including
the parameters in the RPC message. Passing parameters by reference is more complicated
as we need unique global pointers to locate the RPC parameters. Passing parameters is
even more complicated if heterogeneous computers are used as each system may use
different data representations. One solution is to convert the parameters and data to a
standard data representation (XDR) before the message is sent. Generally speaking, the
main design issues of an RPC facility can be summarized in the following points [Pata91]:
•
•
•
•
•
•
RPC Semantics: This defines precisely the semantics of a call in the presence of
computer and communication failures
Parameter Passing: This describes the techniques used to pass parameters between
caller and callee procedures. This defines the semantics of address-containing
arguments in the absence of shared address space (if passing parameters by reference
is used).
Language Integration: This defines how the remote procedure calls will be integrated
into existing or future programming systems.
Binding: This defines how a caller procedure determines the location and the identity
of the callee.
Transfer Protocol: This defines the type of communication protocol to be used to
transfer control and data between the caller and the callee procedures.
Security and Integrity: This defines how the RPC facility will provide data integrity
and security in an open communication network.
5.4.2 Naming Service
Distributed system components include a wide range of objects or resources (logical and
physical) such as processes, files, directories, mailboxes, processors, and I/O devices.
Each of these objects or resources must have a unique name that can be presented to the
operating system when that object or resource is needed. [Tane85]. The naming service
can be viewed as a mapping function between two domains that can be completed in one
or more steps. This mapping function does not have to be unique. For example, if a user
requests a service and several servers can perform the given service, then each server
might possess the same name. Naming is handled in a centralized operating system by
maintaining a table or a database that provides the name to object conversion or mapping.
Distributed operating systems may implement the naming service using either a
centralized, hierarchical or distributed approach.
Centralized Name Server: In this approach, a single name server accepts names in one
domain and maps them to another name understood by the system. In a UNIX
environment, this represents the mapping of an ASCII file name into its I-node number.
A server or a process must first register its name in the database to publicly advertise the
availability of the service or process. This approach is simple and effective for small
systems. If the system is very large either in the number of available objects or in the
131
High Performance Distributed Computing
physical distance covered then the single system database is not sufficient. This approach
is not practical because it allows single-point failure, making the whole system
unavailable when the name server crashes.
Hierarchical Name Server: Another method of implementing the name server is to
divide the system into several logical domains. Logical domains maintain their own
mapping table. The name servers can be organized using a global hierarchical naming
tree. This approach is similar to the mapping technique used in a telephone network in
which a country code is followed by an area code followed by an exchange code all
preceding the users phone number. An object can be located or found by locating which
domain or sub-domain it resides in. The domain or sub-domain will perform the local
mapping and locate the requested object. Another method is to locate objects by using a
set of pairs that will point to either the physical location of the object or the next name
that might contain the physical location of the object. In this manner, several mapping
tables might be searched to locate an object.
Distributed Name Server: A final method that may be used to implement the name
server is to allow each computer or resource in the distributed system to implement the
name service function. Specifically, each machine on the network will be responsible for
managing its own names. Whenever a given object is to be found, the machine requesting
a name will broadcast that name on the network if it is not in its mapping table. Each
machine on the network will then search its local mapping table. If a match exists, a reply
is sent to the requesting machine. If no match is found, no reply is sent by that machine.
5.4.3 Resource Management
Resource management concerns making both local and remote resources available to a
user in an efficient and transparent manner. The location of the resources is hidden from
the user and remote and local resources are accessed in the same manner. The major
difference in resource management in a distributed system as compared to that of a
centralized system is that global state information is not readily available and is difficult
to obtain and maintain [tane85]. In a centralized computing system, the scheduler has full
knowledge of the processes running in the system and global system status is stored in a
large central database. The resource manager can optimize process scheduling to meet
the system objective. In a distributed system, there is no centralized table or global state.
If it did exist, it would be difficult or impossible to maintain, because of the lack of
accurate system status and load. Process scheduling in distributed systems is a
challenging task compared to that of a centralized computing system.
In spite of the resource management difficulties in distributed systems, a resource
manager of some type must exist to allocate processors to users, schedule processes to
processors, balance the load of the system processors, and detect deadlock situations
whenever they occur in the system. The main tasks of resource management include
processor allocation, process scheduling, load balancing, and scheduling.
Processor Allocation
132
High Performance Distributed Computing
The processing units are the most important resource to be managed. In a processor pool
model a large number of processors are available for use by any process. When processes
are submitted to the pool, the following information is required: process type, CPU and
I/O requirements, and memory requirement. The resource manager then determines the
number of processors to be allocated to the process. One can organize the system
processors into a logical hierarchy that consists of multiple masters, each master
controlling a cluster of worker processors [tane85]. In this hierarchy, the master keeps
track of how many of its processors are busy. When a job is submitted, the masters may
collectively decide how the process may be split up among the workers. When the
number of masters becomes large, a node or a committee of nodes is designated to
oversee the processor allocation for the entire system. This eliminates the possibility of a
single point failure. When a master fails, one of the worker processors is designated or
promoted to perform the responsibilities of the failed master. Using a logical hierarchy,
the next step is to schedule the processes among the processors.
Scheduling
Process scheduling is important in multiprogramming environment to maximize the
utilization of system resources and improve performance. Process scheduling is
extremely important in distributed systems because the possibility of idle processors.
Distributed applications that consist of several concurrent tasks need to interact and
synchronize their activities during execution. In such cases, the process or task scheduler
needs to allocate the concurrent tasks to run on different processors at the same time. For
example, consider tasks T1 and T2 (Figure 5.1) that communicate with one another and
are loaded on separate processors.
Machine
1
Processor
1
Processor
2 2
Time
T
2T
3T
4T
5T
T1
X
X
T2
X
X
T1
T2
X
X
Figure 5.1 An Example of Process scheduling
In this example, T1 and T2 begin their execution at time slots, T, and 2T, respectively. In
this case, Task T2 will not receive the message sent by Task T1 during time-slice T until
the next time-slice 2T when it is begins execution. The best timing for the entire requestreply scenario in this case would be 2T. If however, these two tasks are scheduled to run
during time slice 4T on both processors, the best timing for the request-reply scenario is
one T instead of 2T. This type of process scheduling across the distributed system
processors that take into consideration the dependencies among processes is a
challenging research problem.
Another approach for process scheduling is to group processes based on their
communications requirements. The scheduling algorithm will then ensure that processes
133
High Performance Distributed Computing
belonging to one group are context switched and executed simultaneously. This requires
efficient synchronization mechanisms to notify the processors when tasks need to be
context switched.
Load Balancing and Load Scheduling
Load balancing improves the performance of a distributed system by moving data,
computation, or processes so that heavily loaded processors can offload some of their
work to lightly loaded ones. Load balancing has a more stringent requirement than load
scheduling because it strives to keep the loads on all computers roughly the same
(balance the load). We will not distinguish between these two terms in our analysis of
this issue. Load balancing seems intuitive and is almost taken for granted when first
designing a distributed system. However, load balancing and process migration incurs
high overhead and should be done only if a performance gain is larger than the required
communication overhead. Load balancing techniques can be achieved by migrating data,
computation, or entire processes.
Data Migration: In data migration, the distributed operating system brings the required
data to the computation site. The data in question may either be the contents of a file or a
portion of the physical memory. If file access is requested, the distributed file system is
called to transfer the contents of a file. If the request is for contents of the physical
memory, the data is moved using either message passing or distributed shared memory
primitives.
Computation Migration: In computation migration, the computation migrates to
another location. A good example on computation migration is the RPC mechanism,
where the computations of the requested procedure are performed on a remote computer.
There are several cases in which computation migration become more efficient or
provide more security. For example, a routine to find the size of a file should be executed
at the computer where the file is stored rather than transferring the file to the executing
site. Similarly, it is safer to execute routines that manipulate critical system data
structures at the site where these data structures reside rather than transmitting them over
a network where an intruder may tap the message.
Process Migration: In process migration, entire processes are transferred from one
computer and executed on another computer. Process migration leads to better utilization
of resources especially when the process is moved from a heavily loaded computer to a
lightly loaded one. A process may also be relocated due to the unavailability of a
component critical for its execution (e.g. a math coprocessor).
5.4.4 File Service
The file system is the part of the operating system that is responsible for storing and
retrieving stored data. The file system is decomposed into three important functions: disk
service, file service, and directory service [tane85]. The file system characteristics
depend heavily on how these functions are implemented. One extreme approach is to
implement all the functions as one program running on one computer. In this case, the
134
High Performance Distributed Computing
file system is efficient but inflexible. The other extreme is to implement each function
independently so we can support different disk and file types. However, this approach is
inefficient because these modules communicate with each other using inter-process
communication services.
A common file system for a distributed system is typically provided by file servers. File
servers are high performance machines with high storage capacity that offer file system
service to other machines (e.g., diskless workstations). The advantages of using common
file servers include lower system costs (one server can serve many computers), data
sharing, and simplified administration of the file system. The file service can be
distributed across several servers to provide a distributed file service (DFS). Ideally, DFS
should look to users as a conventional unified file system. The multiplicity of data and
the geographic dispersion of data should be transparent to the users. However, DFS
introduces new problems that distributed operating systems must address such as
concurrent access, transparency, and file service availability.
Concurrency control algorithms aim at making parallel access to a shared file system
equivalent to a sequential access of that file system. Most of concurrency control
algorithms used in database research have also been proposed to solve the concurrency
problem in file systems. Various degrees of transparency such as location and access
transparency are desirable in a DFS. By providing location transparency we hide the
physical location of the file from the user. The file name should be location independent.
Additionally, users should be able to access remote files using the same set of operations
that are used to access local files (access transparency). Other important issues that a
distributed file system should address include performance, availability, fault tolerance,
and security.
5.4.5 Fault Tolerance
Distributed systems are potentially more fault tolerant than a non-distributed system
because of inherent redundancy in resources (e.g., processors, I/O systems, network
devices). Fault tolerance enables a computing system to successfully continue operations
in-spite of system component failures (hardware or software resources). Fault intolerant
systems crash if any system failure occurs. There are two approaches of making a
distributed system fault tolerance. One is based on redundancy and the second is based
on atomic transaction [tane85].
Redundancy Techniques: Redundancy is in general the most widely used technique to
implement fault-tolerance. The inherent redundancy in a distributed system makes it
potentially more fault tolerant than a non-distributed system. Fault tolerance techniques
involve detecting fault(s) once they occur, locating and isolating the faulty components,
and then recovering from the fault(s). In general, fault detection and recovery are the
most important and difficult to achieve [anan91].
System failures may either be hardware or software errors. Hardware errors are usually
Boolean in nature and cause software using this hardware to stop working. Software
errors consist of programming errors and specification errors. Specification errors occur
135
High Performance Distributed Computing
when a software system successfully meets its specification but fails because the program
was incorrectly specified. Programming errors occur when the program fails its
specification because of human errors in the programming or in loading that program.
Most research that addresses software errors has focused on programming errors because
specification errors are difficult to characterize and detect [need more references ].
*** any time we make some claims about properties, functionalities, it would be nice to
bring some references to that part *****
One method to tolerate hardware failures is to replicate process execution on more than
one processor. Consequently, the hardware failure of one processor does not eliminate
the functionality of the process because its mirror process will continue its execution on a
fault-free redundant processor. If replicated processes signal the failure event of one
process, the system will then allocate the failed process to another processor in order to
maintain the same level of fault tolerance. However, pure process redundancy can not
tolerate programming errors. If the process fails due to a programming error, the other
replicated processes will produce the same error. Version redundancy has been proposed
to eliminate programming errors [*** need references here ***]. Typically, version
redundancy is used in applications in which the operation of the system is critical and
debugging and maintenance is not possible. For example, in deep space exploration crafts
the system is supposed to work without error for long periods of time. Formal
specifications are given to more than one programmer who independently develops code
to satisfy the formal specification of the application. During application execution the
redundant versions of the application run concurrently and voting is used to mask any
errors. If one version experiences a programming error it is very unlikely that the other
versions will have the same error. Therefore, the programming error is detected and
tolerated.
Atomic Transactions: An atomic transaction is one that runs to completion or not at all.
If an atomic transaction fails, the system is restored to its initial state. To achieve atomic
operations the system must rely on reliable services such as careful read and write
operations, stable storage, and stable processor [tane85].
**** can you please identify tane85 reference, we have been using it heavily in this
chapter and I want to check the source and see why that is case. If it is heavily
dependendant on one source, I might need to change drastically this chapter. The
old version of this chapter has probably all the references. Because of the many
changes made, we lost some of the references ******
Careful Disk Operations: At the disk level, the common operations of WRITE and
READ simply store a new block of data and retrieve a previously stored block of data
respectively. Built on these common primitives, a data abstraction of
CAREFUL_WRITE and CAREFUL_RREAD can be implemented. When a
CAREFUL_WRITE operation calls the WRITE service of the disk, a block of data is
written to the disk and then a READ service is called to immediately read back the
written block to ensure it was not written to a bad sector on the disk. If, after a
predetermined number or attempts, the READ continues to fail the disk block is declared
bad. Even after writing correctly to the disk, a block can go bad. This can be detected by
136
High Performance Distributed Computing
checking the parity check field of any read block during the CAREFUL_READ
operation.
Stable Storage: On top of the CAREFUL_WRITE and CAREFUL_READ abstractions,
the idea of stable storage can be implemented. The stable storage operations mirror all
data on more than one disk so as to minimize the amount of data lost in the event of a
disk failure. When an application attempts to write to memory, the stable storage
abstraction first attempts a CAREFUL_WRITE to what is known as the primary disk. If
the operation completes without error, a CAREFUL_WRITE is attempted on the
secondary disk. In the event of a disk crash the blocks on both the primary and secondary
disks are compared. If the corresponding disk blocks are the same and GOOD, nothing
further needs to be done with these blocks. On the other hand, if one is BAD and the
other is GOOD, the BAD block is replaced by the data from the GOOD block on the
other disk. If the disk blocks are both GOOD but the data is not identical, the data from
the primary disk is written over the data from the secondary disk. The reason for the
latter is that the crash must have occurred between a CAREFUL_WRITE to the primary
disk and a CAREFUL_WRITE to the secondary disk.
Stable Processor: Using stable storage, processes may checkpoint themselves
periodically. In the event of processor failure, the process running on the faulty processor
can restore its last check-point state from stable storage. Given the existence of stable
storage and fault tolerant processors, atomic transactions can be implemented. When a
process wishes to make changes to a shared database, the changes are recorded in stable
storage on an intention list. When all of the changes have been made, the process issues a
commit request. The intention list is then written to memory. Using this method, all the
intentions of a process are stored in stable storage and thus may be recovered by simply
examining the contents of outstanding intention lists in stable storage when a processor is
brought on line.
5.4.6 Security and Protection Service
The distributed operating system is responsible for the security and the protection of the
overall system and its resources. Security and protection are necessary to avoid
unintentional or malicious attempts to harm the integrity of the system. A wild pointer in
a program may unintentionally overwrite part of a critical data structure. Any system may
face a threat from a misguided user attempting to break into the system. Two issues that
must be dealt with in the design of security measures for a system are authentication and
authorization.
Authentication: Authentication is making sure that an entity is what it claims to be. For
example, a person knowing another one's password on the system can log in as that
person. The operating system has no way of knowing whether the right person or an
imposter has logged in. Password protection is sufficient in general, but high security
systems might resort to physical identification, voice identification, cross examination, or
user profiling to authenticate a user. Authentication is especially important in distributed
systems as a person may tap the network and pose as a client or a server. Encryption of
transmitted data is a technique used to deter such attempts.
137
High Performance Distributed Computing
Authorization: On the other hand, authorization is granting a process the right to
perform an action or access a resource based on the privileges it has. Privileges of a user
may be expressed either as an Access Control List (ACLs) or as a Capabilities List (Clist). Only a process having an access right or a capability for an object may access that
object. The ACLs and C-lists are protected against illegal tampering using a combination
of hardware and software techniques. In object based distributed operating systems, the
object name can be in the form of a capability. The capability is a data structure that
uniquely identifies an object in the system as well as the object manager, the set of
operations that can be performed on that object, and provide the required information to
control and protect the access to that object.
5.4.7 Other Services
Various other services have been implemented by distributed operating systems such as
time service, gateway service, print service, mail service, and boot service just to name a
few. New services can also be considered depending on the user requirements and system
configuration.
5.5 Distributed Operating System Case Studies
In general, distributed operating systems can be classified into two categories: message
passing and object-oriented distributed operating systems. Examples of message passingdistributed systems include LOCUS, MACH, V, and Sprite. Examples of object-oriented
distributed systems include Amoeba, Clouds, Alpha, Eden, X-kernel, WebOS, 2K and the
WOS. The operating systems to be discussed in this section include LOCUS, Amoeba, V,
MACH, X-kernel, WebOS, 2K and WOS. These distributed operating systems were
selected as a cross-section of current distributed operating systems with certain
interesting implementation aspects. Each distributed operating system is briefly
overviewed by highlighting its goals, advantages, system description, and
implementation. Additionally, we discuss some of the design issues that characterize
these distributed operating systems.
5.5.1 LOCUS Distributed Operating System
Goals and Advantages
The Locus distributed operating system was developed at UCLA in the early 1980’s
[walk83]. Locus was developed to provide a distributed and highly reliable version of a
Unix compatible operating system [borg92]. This system supports a high degree of
network transparency, namely location, concurrency, transparent file replication system,
and exhaustive fault tolerance system. The main advantages of Locus are its reliability
and the support of automatic replication of stored data. This degree of replication is
determined dynamically by the user. Even when the network is partitioned, the Locus
system remains operational because of the use of a majority consensus system. Each
138
High Performance Distributed Computing
partitioned sub-network maintains connectivity and consistency among its members
using special protocols (partition, merge, and synchronization protocols).
System Description Overview
Locus is a message-based distributed operating system implemented as a monolithic
operating system. Locus is a modification and extension of the Unix operating system
and load modules of Unix systems are executable without recompiling [borg92]. The
communication between Locus kernels is based on point-to-point connections that are
implemented using special communication protocols. Locus supports a single global
virtual address space. File replication is achieved using three physical containers for each
logical file group and replica access is controlled using a synchronization protocol. Locus
supports remote execution and allows pipes and signal primitives to work across a
network. Locus provides robust fault tolerance capabilities that enable the system to
continue operation during network or node failures and has algorithms to make the
system consistent by merging the states from different partitions.
Resource Management
LOCUS provides transparent support for remote processes by providing facilities to
create a process on a remote computer, initialize the process appropriately, support
execution of processes on a cluster of computers, support the inter-process
communication as if it were on a single computer, and support error handling. The site
selection mechanism allows for executing software as either a local or remote site with
no software change required. The user of a process determines its execution location
through information associated with the calling process or shell commands. The Unix
commands fork and exec are used to implement local and remote processes. For increased
performance a run call has been added which has the same effect as a fork followed by an
exec.
Run avoids the need to copy the parent process image and includes
parameterization to allow the user to setup the environment for the new process,
regardless whether it is local or remote.
In Unix, pipes and signal mechanisms rely on shared memory for their implementations.
In order for Locus to maintain the semantics of these primitives when they are
implemented and run on networked computers, it is required to support shared memory.
Locus implements the shared memory mechanism by using tokens.
File System
The Locus file system presents a fully transparent single naming hierarchy to both users
and applications. The Locus file system is viewed as a superset of the Unix file system.
It has extended the Unix system in three areas. First, the single tree structure covers all
the system objects on all the computers in the system. This means that you can not
determine the location of a file from its name. Location transparency allows data and
programs to be accessed or executed from any location in the system with the same set of
commands regardless whether they are local or remote. Second, Locus supports file
replication in a transparent manner, The Locus system is responsible for keeping all
139
High Performance Distributed Computing
copies up to date, assuring all requests are served by the most recent available version,
and supporting operations on partitioned networks. Third, the Locus file system must
support different forms of error and failure management.
One unique feature in Locus file system is its support of file replication and mechanisms
provided to achieve transparent access, read, and write operations on the replicated files.
The need for file replication stems from several sources. First, it improves the user
access performance. This is most evident when reading files. However, updates also
show improvement even with the associated cost of maintaining consistency across the
network. Second, the existence of a local copy significantly improves file access
performance as remote file access is expensive even on high speed networks.
Furthermore, replication is essential in supporting various characteristics of a distributed
system such as fault tolerance and system data structures. However, file replication does
come with a cost. When a file is updated and many copies of that file exist in the system,
the system must make ensure that all local copies are current. Additionally, the system
must determine which copy to use when that file needs to be accessed. If the replicated
files are not consistent due to network partitions and hardware failures, a version number
is used to make ensure a process accesses the most up-to-date copy of a file. A good data
structure to be replicated is the local file directory since it has a high ratio of read to write
accesses.
Locus defines three types that a site may take during file access ( Figure 5.2):
•
Using Site (US). This is the site from which the file access originates and to which
the file pages are sent.
• Storage Site (SS). This is the site at which a copy of the requested file is stored.
• Current Synchronization Site (CSS). This site forces a global synchronization
policy for the file and selects SS’s for each open request. The CSS must know which
sites store the file and what the most current version of the file is. The CSS is
determined by examining the logical mount table.
Any individual site can operate in any combination of these logical sites.
Open(1)
US
CSS
Response(4)
Response(3)
Be SS?(2)
SS
Figure 5.2 LOCUS Logical File Sites.
File access is achieved by several primary calls: open, create, read, write, commit, close,
and unlink. The following sequence of operations is required to access any file:
1. US makes an OPEN request to the CSS site
2. CSS makes a request for a storage site SS
140
High Performance Distributed Computing
3. SS site responds to the CSS request message
4. CSS selects a storage site (SS) and then informs the using site (US)
After a file has been opened, the user can read the file by issuing a read call. These read
requests are serviced using kernel buffers and are distinguished as either local or remote.
In the case of a local file access, the requested data from external storage is stored in the
operating system buffer and then copied into the address space of the user buffer. If the
request was for a remote file, the operating system at the using site (US) allocates a
buffer and queues a remote request to be sent over the network to the SS. A network read
has the following sequence:
1. US requests a page in a file
2. SS responds by providing the requested page
The close system call uses an opposite sequence of operations. The write request follows
the same steps taken for open request once we replace the read requests with write
requests to the selected storage site. It is important to note that file modifications are done
atomically using a shadowing technique. That is, the operating system always possesses a
complete copy of the original file or a completely changed file, never a partially modified
one.
Fault Tolerance
Locus supports fault tolerance operations based on redundancy. Error detection and
handling are integrated with a process description data structure. When one of the
interacting processes experiences a failure (e.g., child and parent processes), an error
signal is generated to notify the other fault-free process to modify its process data
structure. The second important issue that must be addressed is recovery. The basic
recovery strategy in Locus is to keep all the copies of a file within a single partition
current and consistent. Later, these partitions are merged using merge protocols.
Conflicts among partitions are resolved through automatic or manual means.
The Locus system supports reconfiguration. This enables the system to reconfigure the
network and system resources as necessary in response to faults. The reconfiguration
protocols used in Locus assume a fully connected network. Thus for reconfiguration
Locus utilizes a partition protocol and a merge protocol. Partitions within Locus are an
equivalence class requiring all members to agree on the state of the network. Locus uses
an iterative intersection process to find the maximum partition. As partitions are
established a merge is executed to join several partitions.
Other Services
The Locus directory system is used to implement the name service, interprocess
communication, and remote device access. Locus has no added security and protection
features other than those supported in Unix operating system.
5.5.2 Amoeba Distributed Operating System
Goals and Avantages
141
High Performance Distributed Computing
The Amoeba Distributed Operating System was developed at the Free University and
Center for Mathematics and Computer Science in Amsterdam [mull90]. Amoeba’s main
goal was to build a capability based, object-oriented distributed operating system to
support efficient and transparent distributed computing across a computer network. The
main advantages of Amoeba are its transparency, scalability, and the support of high
performance parallel and distributed computing over a computer clusters.
System Description Overview
The Amoeba system is an object-oriented capability-based distributed operating system.
The Amoeba micro-kernel runs on each machine in the system and handles
communication, I/O, and low-level memory and process management. Other operating
system functions are provided by servers running as user programs. The Amoeba
hardware consists of four principle components (Figure 5.3): workstations, the processor
pool, specialized servers, and the gateway. The workstations run computation intensive
and interactive tasks such as window management software and CAD/CAM applications.
The processor pool represents the main computing power in the Amoeba system and is
formed from a group of processors and CPUs. This pool can be dynamically allocated as
needed to the user and system processes and returned to the pool once their assigned
tasks are completed. For instance, this pool can run computation intensive tasks that do
not require frequent interaction. The third component consists of specialized servers that
run special programs or services such as file servers, database servers, and boot servers.
The fourth component consists of gateways that are used to provide communication
services to interconnect geographically dispersed Amoeba systems.
Processor pool
Gateway
Servers
Super computer
servers
Multicomputer
WAN
...
…
.
.
.
...
LAN
Processor array
…
:
Workstations
…
Terminals
Figure 5.3. Ameoba Distributed Operating System Architecture.
142
High Performance Distributed Computing
The Amoeba software architecture is based on the client/server model. Clients submit
their request to a corresponding server process to perform operations on objects [mull90].
Each object is identified and protected by a capability that has 128 bits, and is composed
of four fields, as shown in Figure 5.4.
48
Service
Port
24
Object
Number
8
Right
Field
48
Check
Field
Bits
Figure 5.4 Capability in Amoeba
The service port field is a 48-bit sparse address identifying the server process that
manages the object. The object number field is used by the server to identify the object
requested by the given capability. The 8-bit right field defines the type of operations that
can be performed on the object. The 48-bit check field protects the capability against
forging and tampering. The rights in the capability are protected by encrypting them with
a random number, storing the result in the check field. A server can check a capability by
performing the encryption operation, using the random number stored in the server's
tables, and comparing the result with the check field in the provided capability.
The Amoeba kernel handles memory and process management, supports processes with
multiple threads, and provides inter-process communications. All other functions are
implemented as user programs. For example, the directory service is implemented in the
user space and provides a higher level of naming hierarchy that maintains a mapping of
ASCII names onto capabilities.
Inter-Process Communication
All communications in Amoeba follows the request/response model. In this scheme the
client makes a request to the server. The server performs the requested operation and then
sends the reply to the client. This communication model is built on an efficient remote
procedure call system that consists of three basic system calls:
1. get_request(req-header, req-buffer, req-size)
2. do_operation(req-header,req-buffer,req-size,rep-header, rep-buffer,rep-size)
3. send_reply(rep-header, rep-buffer, rep-size)
When a server is ready to accept requests from clients, it executes a get_request system
call that forces it to block. In this call, it specifies the port on which it is willing to
receive requests. When a request arrives, the server unblocks, performs the work using
143
High Performance Distributed Computing
the parameters, and sends back the reply. The reply is sent using the send_reply system
call. The client makes a request by issuing a do_operation. The caller is blocked until a
reply is received, at which time the reply parameters are populated and a status returned.
The returned status of a do_operation can be one of the following:
1. The request was delivered and has been executed.
2. The request was not delivered or executed.
3. The status is unknown.
To enable system programmers and users to use richer set of primitives for their
applications, a user-oriented interface has been defined. This has lead to the development
of the Amoeba Interface Language (AIL) that can be used to describe operations to
manipulate objects and support multiple-inheritance mechanism. Stub routines are used
in Amoeba to hide the marshaling and message passing from the users. The AIL compiler
produces stub routines automatically in the C language.
Resource Management
In many applications, processes need a method to create child processes. In Unix a child
process is created using the fork primitive. An exact copy of the original process is
created. This process runs housekeeping activities and then issues an exec primitive to
overwrite its core image with a new program. In a distributed system, this model is not
attractive. The idea of first building an exact copy of the process, possibly remote, and
then throwing it away again shortly thereafter is inefficient. The Amoeba uses a different
strategy. The key concepts are segments and process descriptors. A segment is a
contiguous chunk of memory that can contain code or data. Each segment has a
capability that permits its holder to perform operations on it, such as reading or writing.
A process descriptor is a data structure that provides information about a process. It
provides the process state that is either running or stunned. A stunned process is a
process being debugged or migrated; that is the process exists but does not execute any
instructions.
When the process descriptor arrives at the machine where the process will run, the
memory server extracts the capabilities for the remote segments and fetches the code and
the data segments from where they reside. This is done by using the capabilities to
perform READ operations in the usual way; the contents of the segment are copied to it.
In this manner, the physical location of the machines involved becomes irrelevant. Once
all the segments have been filled in, the process can be constructed and initiated. A
capability for the process is then returned to the initiator. This capability can be used to
kill the process or it can be passed to a debugger to stun it, read and write its memory,
and so on.
To migrate a process, it must first be stunned. Once stunned, the kernel sends its state to a
handler. The handler is identified using a capability present in the process's state. The
handler then passes the process descriptor on to the new host. The new host fetches its
memory from the old host through a series of file reads. Then the process is started and
the capability returned to the handler. Finally, the handler sends a kill in its reply to the
144
High Performance Distributed Computing
old host. Processes trying to communicate with a migrating process get a ``Process
Stunned'' reply when it is stunned, and a ``Process Not Here'' reply when the migration is
complete. It is the responsibility of the requesting process to find the location of the
process it is attempting to contact.
File System
The Amoeba file system consists of two parts: the file service and the directory service.
The file service provides a mechanism for users to store and retrieve files, whereas the
directory service provides users with facilities for managing and manipulating
capabilities.
The file system is implemented using the bullet service. The bullet service does not store
files as collections of fixed-size disk blocks. It stores all files contiguously both on disk
and in the bullet server's memory. When the bullet server is booted, the entire i-node
table is read into memory in a single disk operation and kept there while the server is
running. When a file operation is requested, the object number field in the capability is
extracted, which is an index into the table. The file entry gives the disk address as well as
the cache address of the contiguous file as shown in Figure 5.5. No disk access is needed
to fetch the i-node and at most one disk access is needed to fetch the file itself, if it is not
in the cache. The simplicity of this design trades high performance for space.
File table
File 1
Data
File 2
Data
File 3
Data
Figure 5.5 Amoeba File System
The Bullet service supports only three basic operations on files: read file, create file and
delete file. There is no write file that makes the files immutable. Once a file is created it
cannot be modified unless explicitly deleted. When a file is created, the user provides all
the required data and a capability is returned. Keeping files immutable and storing them
contiguously has several advantages:
145
High Performance Distributed Computing
•
•
•
File retrieval is carried out in one disk read
Simplified file management and administration
Simplified file replication and caching because of the elimination of inconsistency
problems
The bullet service does not provide a high level naming service. To access a file, a
process must provide the relevant capability. Since working with 128-bit binary numbers
is not convenient for users, a directory service has been designed and implemented to
manage names and capabilities. The directory in Amoeba has a hierarchical structure
which facilitates the implementation of partially shared name spaces. Directories in
Amoeba are also treated as objects and users need capabilities to access them. Thus a
directory capability is a capability for many other capabilities. Essentially, a directory is a
map from ASCII strings onto capabilities. A process can present a string, such as a file
name, to the directory server, and the server returns the capability for that file. Using this
capability, the process can access the file.
Security and Protection
In Amoeba, objects are protected using capabilities and server access is controlled by
ports. Access to objects such as files, directories, or I/O devices can be granted or denied
to users by specifying these access rights in the capabilities. The capabilities themselves
must be protected against illegal tampering by users. To hide the contents of capabilities
from unauthorized users they are encrypted using a key chosen randomly from a large
address space. This key is stored in the check field of the capability itself. The
repositories of the capabilities, the directories, can also be encrypted so that bugs in the
server or the operating system do not reveal confidential information.
The Amoeba system also provides a method for secure communications. A function box,
or F-box, is put between each computer and the network. This F-box may be
implemented in hardware as a VLSI chip on the network interface board, or as software
built into the operating system. The second approach can be used in trusted computers.
The F-box performs a simple one way function, P = F (G), where given F and G, P can
be easily found by applying the function F. However, given P and F is computationally
intractable to determine G (see Figure 5.6). Thus, to protect the port on which the servers
listen to, Amoeba makes the server port P known publicly, whereas G is kept secret.
When a server performs the operation get_request (G), the F-box computes P = F(G) and
waits for messages to arrive on P. On the other hand, when a client issues a
do_operation(P), the F-box does not carry out any transformation.
P=F(G)
Put_addr (P) known
Get_addr (G) secret
146
High Performance Distributed Computing
Figure 5.6 Amoeba security based on a one-way function
If a user tries to impersonate a server by issuing a get_request (P) (G is a secret), the user
will be listening to port F(P), which is not the server's port and is useless. Similarly,
when a server sends a reply to a client, the client listens on port G' = F(P'), where P' is
contained in the message sent by the server. Thus both servers and clients can be
protected against impersonation using this simple scheme. Further, the F-box may be
used for authenticating digital signatures. The original signature S is known only to the
sender of the message, whereas S' = F(S) could be known publicly.
Other Services
The Amoeba system provides limited fault tolerance. The file system can maintain
multiple copies of files to improve reliability and fault tolerance. The directory service is
crucial for the functioning of the system since it is the place where all processes look for
the capabilities they need. The directory service replicates all its internal tables at
multiple sources to improve the service reliability and fault tolerance.
5.5.3 The V Distributed Operating System
Goals and Advantages
The V distributed operating system was developed at Stanford University to research
issues of designing a distributed system using a cluster of workstations connected by a
high performance network. [cher88]. The V system uses high performance inter-process
communications that are comparable to local inter-process communications. The V system uses a special communication protocol, Versatile Message Transaction Protocol
(VMTP), which is optimized to support RPCs and client-server interactions. The V
system provides efficient mechanisms to implement shared memory and group
communications.
System Description Overview
The V distributed operating system is a microkernel based system with remote execution
and migration facilities. The V system provides resource and information services of
conventional time-shared single computers to a cluster of workstations and servers. The
V -kernel is based on three principles:
1) high-performance communication is the most critical component in a distributed
system
2) uniform general purpose protocols are needed to support flexibility in building
general purpose distributed systems
3) a small micro-kernel is needed to implement the basic protocols and services (process
management, memory and communications).
The V system has an efficient communication protocol to deliver efficient data transport,
naming, I/O, atomic transactions, remote execution, migration and synchronization. The
147
High Performance Distributed Computing
V microkernel handles inter-process communication, process management, and I/O
management. All other services are implemented as service modules running in the user
mode. The other distributed system services are implemented at the process level in a
machine and network independent manner.
System services are accessed by application programs using procedural interfaces. When
a procedure is invoked, it attempts to execute the function in its own address space if
possible. If unsuccessful, it uses the uniform inter-process communication protocol (VIPC) to contact the appropriate service module and implements the requested procedure.
To support a global system state, distributed shared memory has been implemented by
caching the shared state of the system at different nodes as virtual memory pages.
Inter-Process Communication:
In the V system, the kernel and VMTP are optimized to achieve efficient inter-process
communications as outlined below:
1. Supporting Short Messages: Most of the RPC traffic and 50% of V’s messages can
fit in 32 byte messages. Consequently, the V system has been optimized at the kernel
interface and network transmission level to transfer fixed-size (32 bytes) short
messages very efficiently.
2. Efficient Communication Protocol: The V system uses a transport protocol called
the Versatile Message Transaction Protocol (VMTP) that has been optimized for
request-response behavior. There is no explicit connection setup and tear down in
order to improve the performance of request/response communication services.
VMTP optimizes error handling and flow control by using the reply message as an
acknowledgment to the client request. VMTP also supports datagrams, multicast,
priority, and security.
3. Efficient Message Processing: The cost of communication processing at the kernel
level has been minimized by having process descriptors contain a template VMTP
header with some of its fields initialized at process creation. This leads to reducing the
time required to prepare the packets for network transmission.
The inter-process communication mechanisms provided are based on request/response
message and shared memory models. In the request/response model, a process may
communicate with another by sending and receiving fixed-size messages. A send by a
process blocks itself until a reply has been received. This implementation utilizes
message passing with blocking semantics and corresponds to the RPC model. In the
message model the server is implemented as a dedicated server that receives and acts on
incoming messages. The client request is queued if the server process is found to be busy
serving other requests. The message model is preferred when the requests to a server
need to be serialized. On the other hand, the RPC mechanism is preferred when the
requests need to be handled concurrently by the server. In both cases, the receiver or the
server performs a receive to process the next message and then invokes the appropriate
procedures to handle the message and send back a reply.
148
High Performance Distributed Computing
The V system also supports distributed shared memory. In this form of inter-process
communication a process can pass a segment in its team space (address space on the
computer) to another process. After sending the message, the sender is blocked, while the
receiver can read or write its segment using the primitives, CopyFrom and CopyTo,
provided by the kernel. The sender is unblocked only when it receives a reply from the
other process, once its task is completed. This model utilizes the data segment provided
by VMTP.
The V inter-process communication also supports group communication. This is
necessary due to the number of processes working together in distributed systems. V
supports the notion of a process group - a set of processes having a common group
identifier. Any process can send and receive messages from process groups. To send a
message to a group requires a group identifier instead of the processor identifier as the
parameter. The multicast feature is exploited by different V services. For instance, it is
used by the file service for replicated file updates and by the scheduler for collecting and
dispersing load information.
Resource Management:
The key resources that the kernel manages are processes, memory, and devices. Other
shared devices are managed by user-level servers. For example, printers and files are
managed by the print server and the file server respectively.
Process Management: The V kernel simplifies process management by reducing the
tasks required to create and terminate a process and migrating some of the kernel process
management tasks to user-level programs. The V system makes process initiation
independent of address space creation and initialization. This makes process creation to
be a matter of allocating and initializing a new process descriptor. Process termination is
also simplified because there are few resources at the kernel level to reclaim. Client
kernels do not inform server modules when processes are killed or terminate. Each server
module in the V system is responsible for periodically checking the client process to see
if it exists. If it does not it reclaims the resource. In a file server module there exists a
“garbage collector” process that closes files associated with dead processes.
The V kernel scheduling policy has been designed to be simple and small. In an N node
distributed system, the scheduling policy is to have all N processors run the N highest
priority processes at any given time. This requires exchanging information with other
computers in the system to implement distributed scheduling by maintaining statistics
about the load on its host. A user can run programs on the least lightly loaded machine,
thus utilizing CPU cycles on idle workstations. This is carried out transparently by the
scheduler. Users can effectively utilize most of the CPU time of the workstations in the
cluster, eliminating the need for a dedicated processor pool. Remote processes, called
guest processes are run at a guest priority to minimize their effect on the workstation
user's processes. A user can even off load guest processes by migrating them to other
nodes. A process can be suspended by setting its priority to a low value. Facility is also
provided for freezing and unfreezing a process, which is used to control modifications to
its address space during migration.
149
High Performance Distributed Computing
Memory Management: The V kernel memory manager supports demand paging by
associating address spaces to an open file. The kernel provides caching and consistency
mechanisms for these regions. A reference to an address in a region is interpreted as a
reference to the corresponding file data. A page fault is generated when a reference is
made to an address in an unbounded region. On a page fault, the kernel either binds the
corresponding block in the file to the region or makes the process do a READ request to
the server that is managing the file. An address space is created simply by allocating a
descriptor and binding an open file to that address.
Device Management: Conventionally, I/O facility is considered as the primary means of
communication between programs and their environment. In the V system, inter-process
communication services are used for this purpose. In V, I/O is just a higher level protocol
used with IPC to communicate a larger grain of information. All I/O operations follow a
uniform protocol called the UIO protocol. The UIO protocol is block oriented which
allows reading and writing blocks of data on a device. The kernel provides device
support for disks, mice, and network interfaces through the device server. The device
server is an independent machine that implements the UIO system interface. Process
level servers for other devices are built upon the basic interfaces provided by the kernel.
Naming Service:
The V naming scheme is implemented as a three level model that consists of character
string names, object identifiers, and entity identifiers. In this model, there is no specific
name server. Each object manager implements the naming service for the set of objects it
manages. This requires the maintenance of a unique global naming scheme. This is
achieved by picking a unique global name prefix for each manager and adding it to the
name handling process group. A client can locate the object manager for a given string
by doing a multicast QueryName operation. To avoid the multicast operation each time
an operation has to be performed on an object, each program maintains a cache of name
prefix to manager bindings. This cache is initialized at the time of process initialization to
avoid delays during execution.
As both the directory for the objects and the objects themselves are implemented by the
same server, consistency between the two increases greatly. The directory is replicated to
the same extent as the object itself. This eliminates the situation in which objects are
inaccessible because of the failure of the name server. Also, new services implementing
their own directories can be easily incorporated.
To avoid the overhead of character string name lookup, an object identifier is used to
refer to the object in subsequent operations. An object identifier consists of two fields
with the first one specifying the identification (ID) of the manager and the second one
specifying the ID of the object. The transport level identifier of the server can be used as
its ID to speed up its location. This model also allows the efficient allocation of local
identifiers, as the manager does not have to communicate with a name server to
implement it. The identifier just has to be unique on that server and it becomes globally
unique when it is prefixed with the manager ID.
Finally, entity identifiers are fixed-length binary values that identify transport level
endpoints. The entity identifiers identify processes and serve as process and group
150
High Performance Distributed Computing
identifiers. Their most important property is that they are host-address independent. Thus
processes can migrate without affecting their entity identifiers. To achieve this
independence they must be implemented globally. This gives rise to allocation problems
as it requires cooperation among all instantiations of the kernel. Entity identifiers are
mapped to host addresses using a mechanism similar to that used for mapping character
string names. A cache of these mappings and the multicast facility are used by the kernel
to improve performance.
File Service:
The file system in V is implemented as a user-level program running on a central file
system server outside the operating system. The V system implements the mapped-file
I/O mechanism to provide fast and efficient access to files. A file OPEN request is
mapped onto a locally cached open file. Files are accessed using the standard UIO
interface. In the UIO model, I/O operations are carried out by creating UIO objects.
These objects are equivalent to open files in other systems. Operations like read, write,
query, or modify are performed on these objects. The UIO interface implements blockoriented access instead of conventional byte streams to facilitate efficient handling of file
blocks, network packets, and database records. Read and write requests on files are
carried out locally if the data is present in the cache. Otherwise, a READ request is sent
out to the server managing the original file and the addressed block is brought in and
cached.
Using the UIO operations the kernel can take advantage of the workstations’ large
physical memories. Caching files increases performance, as a local block access is even
faster than getting a block from backing store. This technique is extremely beneficial in
the case of diskless workstations, facilitating their usage and reducing the overall cost of
the system. The V file server uses large buffers of 8 Kbytes for efficiently transferring
large files over the network with minimal overhead. Files in V use a 1 Kbyte allocation
unit. Even though they are divided into 1K blocks, a contiguous allocation scheme is
used so that most files are stored contiguously on disk.
Other Services: The V system provides limited security and protection in that it assumes
that the kernel, servers, and the messages across the network cannot be tampered with.
Thus network security against intruders is not supported in the V system. The protection
scheme used is similar to the one provided by UNIX. Each user has an account name and
an associated password. A user is also given a user number. This user number is
associated with the messages sent by processes created by the user. Security and
protection is achieved using an authentication server that matches the name and the
password against an encrypted copy stored on it. On a match the server returns a success,
upon which the kernel associates the user number with that process. Thus any message
sent or received by the process from this point on would contain the user number in it.
The V system does not provide any specific reliability or fault tolerance techniques.
Software reliability is achieved by keeping the kernel small and the software modular.
151
High Performance Distributed Computing
5.7.4 MACH Distributed Operating System
Goals and Advantages
The MACH operating system is designed to integrate the functionality of both distributed
and parallel computing in an operating system that is binary compatible with Unix BSD.
This enables Unix users to migrate to a distributed operating system environment without
giving up convenient services. The main objectives of MACH are the ability to support
(emulate) other operating systems, support all types of communication services (shared
memory and message passing), provide transparent access to network resources, exploit
parallelism in systems and applications, and provide portability to a large collection of
machines [tane95].
*** againg we need to check this reference, I believe it is Tannenbaum 95 book on
distributed operating system 8*****
System Description Overview
Mach is a micro-kernel based message-passing operating system. It has been designed to
provide a base for building new operating systems and emulating existing ones (e.g.,
UNIX, MS-Windows, etc.) as shown in Figure 5.7. The emulated operating systems run
on top of the kernel as servers or applications in a transparent manner. MACH is capable
of supporting applications with more than one operating system environment
simultaneously (Unix and MS-DOS).
User processes
User space
Software
Emulator
Layer
4.3 BSD
emulator
System V
Emulator
HP/UX
emulator
Other
emulator
Microkernel
Kernel space
Figure 5.7 MACH Distributed Operating System
The Mach kernel provides basic services to manage processes, processors, memory,
inter-process communications, and I/O services. The Mach kernel supports five main
abstractions [tane95]: processes, threads, memory objects, ports, and messages. A process
152
High Performance Distributed Computing
Suspend counter
Scheduling parameters
Emulation address
Statistics
Address
Space
Process
Threads
Process
Port
Bootstrap
Port
Exception
Port
Registered
Port
Kernel
in Mach consists primarily of an address space and a collection of threads that execute in
that address space as shown in Figure 5.8.
Figure 5.8 Mach: Process Management
In Mach, processes are passive and are used for collecting all the resources related to a
group of cooperating threads into convenient containers. The active entities in Mach are
the threads that execute instructions and manipulate their registers and address spaces.
Each thread belongs to exactly one process. A process cannot do anything unless it has
one or more threads. A thread contains the processor state and the contents of a machine's
registers. All threads within a process share the virtual memory address space and
communications privileges associated with their process. The UNIX abstraction of a
process is simulated in Mach by combining a process and a single thread. However,
Mach goes beyond this abstraction by allowing multiprocessor threads to execute in
parallel on separate processors. Mach threads are heavyweight threads because they are
managed by the kernel.
Mach adopts the memory object concept to implement its virtual memory system. A
memory object is a data structure that can be mapped into a process’ address space. InterProcess communication is based on message passing implemented using ports. Ports are
kernel mailboxes that support unidirectional communication.
Resource Management
The main resources to be managed include processes, processors, and memory. The
Mach management of these resources is presented next.
Process Management:
A process in Mach consists of an address space and a set of threads running in the
process address space. The threads are the active components while processes are passive
and act as containers to hold all the required resources for its thread’s execution.
Processes use ports for communication. Mach provides several ports to communicate
with the kernel such as process port, bootstrap port, and exception port. The kernel
services available to processes are requested by using process ports rather than making a
153
High Performance Distributed Computing
system call. The bootstrap port is used for initialization when a process starts up in order
to learn the names of kernel ports that provide basic services. The exception port is used
by the system to report errors to the process. Mach provides a small number of primitives
to manage processes such as create, terminate, suspend, resume, priority, assign, info,
and threads.
Thread Management:
Threads are managed by the kernel. All the threads belonging to one process share the
same address space and all the resources associated with that process. In uni-processor
system, threads are time shared. However, they run in concurrently in a multiprocessor
system. The primary data structure used by the Mach to schedule threads for execution is
the run queue. The run queue is a priority queue of threads implemented by an array of
doubly linked queues. A hint is maintained to indicate the probable location of the
highest priority thread. Each run queue also contains a mutual exclusion lock and a count
of threads currently queued. When a new thread is needed for execution, each processor
consults the appropriate run queue. The kernel maintains a local run queue for each
processor and a shared global run queue. Mach is self-scheduling in that instead of
having threads assigned by a centralized dispatcher, individual processors consult the run
queues when they need a new thread to run. A processor examines the local run queue
first to give local threads absolute preference over remote threads. If the local queue is
empty, the processor examines the global run queue. In either case, it dequeues and runs
the highest priority thread. If both queues are empty, the processor becomes idle.
Processor allocation:
Mach aims at supporting a multitude of applications, languages, and programming
models on a wide range of computer architectures. Consequently, the processor
allocation approach must be flexible and portable to many different platforms. The
processor allocation approach adds two new objects to the Mach kernel interface, the
processor and the processor set. Processor objects correspond to and manipulate physical
processors. Processor objects are independent entities to which threads and processes can
be assigned. Processors only execute threads assigned to the same processor set and vice
versa, and every processor and thread is always assigned to the same processor set. If a
processor set has no assigned processors then threads assigned to it are suspended.
Assignments are initialized by an inheritance mechanism. Each process is also assigned
to a processor set but this assignment is used only to initialize the assignment of threads
created in that process. In turn, each process inherits its initial assignment from its parent
upon creation and the first process in the system is initially assigned to the default
processor set. In the absence of explicit assignments, every thread and process in the
system inherits the first process's assignment to the default processor set. All processors
are initially assigned to the default processor set and at least one processor must always
be assigned to it so that internal kernel threads and important daemons can remain active.
Mach processor allocation approach is implemented by dividing the responsibility for
processor allocation among the three components: application, server, and kernel as
shown in Figure 5.9.
154
High Performance Distributed Computing
Applications control the assignment of processes and threads to processor sets. The
server controls the assignment of processors to processor sets. The kernel does whatever
the application and server requests. In this scheme, the physical processors allocated to
the processor sets of an application can be chosen to match the application requirements.
Assigning threads to processor sets gives the application complete control over which
threads run on which processors. Furthermore, isolating scheduling policy in a server,
simplifies changes for different hardware architectures and site-specific usage policies.
Figure 5.9 Processor allocation components
Memory management:
Each Mach process can use up to 4 gigabytes of virtual memory for the execution of its
threads. This space is not only used for the memory objects but also for messages and
memory-mapped files. When a process allocates regions of virtual memory, the regions
must be aligned on page boundaries. The process can create memory objects for use by
its threads and these can actually be mapped to the space of another process. Spawning
new processes is more efficient because memory does not need to be copied to the child.
The child needs only to touch the necessary portions of its parent's address space. When
spawning a child process it is possible to mark the pages to be copied or protected.
Each memory object that is mapped in a process' address space must have an external
memory manager that controls it. Each class of memory objects is handled by a different
memory manager. Each memory manager can implement its own semantics, can
determine where to store pages that are not in memory, and can provide its own rules
about what happens to objects once they have been mapped out. To map an object into a
process' address space the process sends a message to a memory manager asking it to
perform the mapping. Three ports are needed to achieve the requested mapping: object
port, control port, and name port. The object port is used by the kernel to inform the
memory manager when page faults occur and other events relating to the object. The
control port is created to enable memory managers to interact and respond to kernel
requests. The name port is used to identify the object.
The Mach external memory manager concept lends itself well to implementing a page
based distributed shared memory. When a thread references a page that it does not
possess, it creates a page fault. Eventually the page is located and shipped to the faulting
machine where it is loaded so the thread can continue execution. Since Mach already has
memory managers for different classes of objects, it becomes natural to introduce a new
memory object, the shared page. Shared pages are explicitly managed by one or more
155
High Performance Distributed Computing
memory managers. One possibility is to have a single memory manager that handles all
shared pages. Another is to have a different memory manager for each shared page or
collection of shared pages. This allows the load to be distributed. The shared page is
always either readable or writeable. If it is readable it may be replicated on multiple
machines. If it is writeable, only one copy exists. The distributed shared memory server
always knows the state of the shared page as well as which machine or machines it
resides on.
Inter-Process Communication
The basis of all communication in Mach is a kernel data structure called a port. A port is
essentially a protected mailbox. When a thread in one process wants to communicate
with a thread in another process the sending thread writes the message to the port and the
receiving thread takes it out. Each port is protected to ensure that only authorized
processes can send to and receive from it. Ports support unidirectional communication
and provide reliable, sequenced message streams. If a thread sends a message to a port,
the system guarantees that it will be delivered. Ports may be grouped into port sets for
convenience. A port may belong to only one port set.
A message is a string of data prefixed by a header. The header describes the message and
its destination. The body of the message may be as large as the entire address space of a
process. There are simple messages that don't contain any references to ports and nonsimple messages that can reference other ports (conceptually similar to indirect
addressing). Messages are the primary way that processes communicate with each other
and the kernel. They can even be sent between processes on different computers.
Messages are not actually stored in the port itself but rather in another kernel data
structure: the message queue. The port contains a count of the number of messages
currently present in the message queue and the maximum number of messages permitted.
Since messages are actually mapped to the virtual memory resources of processes,
interprocess communication is far more efficient than UNIX implementations where
messages are copied from one process to the limited memory space of the kernel and then
to the process receiving the message. In Mach, the message actually resides in the
memory space shared by the communicating processes. Memory-mapped files facilitate
program development by simplifying memory and file operations to a single set of
operations for both. However, Mach still supports the standard UNIX file read, write,
and seek system calls.
Mach also supports communication between processes running on different CPUs by
using Network Message Servers (NMS). The NMS is a multithread process that performs
a variety of functions that include: interfacing with local threads, forwarding messages
over the network, translating data types, providing network-wide area lookup service, and
providing an authentication services.
156
Formatted: Font: 小四
Formatted: Font: 小四
High Performance Distributed Computing
Formatted: Font: 小四
Formatted: Font: 小四
5.8.6 X-Kernel Distributed Operating System
Goals and Advantages
The X-Kernel is an experimental operating system kernel developed at the University of
Arizona, Tucson [hutc89]. X-Kernel can be viewed as a tool kit that provides all the
tools and building blocks to develop and experiment with distributed operating systems
configured to meet certain class of applications. The main advantages of X-Kernel are XKernel configurability and its ability to provide an efficient environment to experiment
with new operating systems and different communications protocols.
System Description Overview
X-Kernel is a microkernel based operating system that may be configured to support
experimentation in inter-process communication and distributed programming. The
motivation of this approach is two fold:
1) no communication paradigm is appropriate for all applications
2) The X-Kernel framework may be used to obtain realistic performance measurements.
The X-Kernel consists of a kernel that supports memory management, lightweight
processes, and development of different communications protocols. The exact
configuration of the kernel services and the communication protocols used determines an
instance of the X-Kernel operating system. Therefore, we do not consider X-Kernel to be
a single operating system, but rather a toolkit for constructing and experimenting with
different operating systems and protocols [BORG92].
Resource Management
An important aspect of any distributed system involves its ability to pass control and data
efficiently between the kernel and user programs. In X-Kernel, the transfer between user
applications and kernel space has been made efficient by having the kernel execute in the
same address space of the user data (see Figure 5.10).
Formatted: Font: 小四
Process stack
USP
User
stack
Process stack
Formatted: Font: 小四
Formatted: Font: 小四
(Private
KSP
Kernel code
/
Data area
Kernel stack
(Shared
(Shared
Virtual Address
User code /
Data area
Figure 5.10 X-Kernel and user address space
157
Formatted: Font: 小四
Formatted: Font: 小四
High Performance Distributed Computing
The user process can access the kernel processes efficiently because they run in the same
address space. This access requires approximately 20 microseconds on a SUN 3/75
[Hutc89]. The kernel process can also access user data after properly setting the user
stack and its arguments. A kernel to user data access takes around 245 microseconds on
SUN 3/75.
Inter-Process Communication:
There are three communication objects used by the kernel to handle communications:
protocols, sessions, and messages. A different protocol object is used for each protocol
type being implemented. The session object is the protocol’s objects interpreter and
contains the data representing the protocol state. The message objects are transmitted by
the protocol and session objects. There are multiple address spaces in the X-Kernel
which consist of the kernel area, user area, and stack. If more than one process exits in a
given address space the kernel area is shared but each process has its own private stack.
To communicate between processes in different address spaces or different machines the
protocol objects are used to send messages. Processes in the same address space are
synchronized using kernel semaphores. A process can execute in either user mode or
kernel mode. In kernel mode the process has access to the user and kernel areas. In user
mode, the process only has access to user information.
X-Kernel provides several routines to construct and configure a wide variety of protocols.
These routines include buffer manager, map manager, and event manager. The buffer
manager uses the heap to allocate buffers to move messages. The map manager is used
to map an identifier from a message header to capabilities used by the kernel objects.
The event manager allows a protocol to invoke procedures with a timed event.
A protocol object can create a session object and de-multiplex messages before being
sent to the session object. Three operations that the protocol object uses are open,
open_enable, and open_done ( Figure 5.11).
Figure 5.11: Protocol (a) and Session (b) objects
The open creates a session object caused by a user process, the open_enable and
open_done are invoked by an arriving message from the network. The protocol object is
also a demux function operation and can send any message from the network to one of
the session created by the protocol object that called the demux protocol. Sessions have
two operations: push and pop. Push is used by a higher session to send a message to a
158
High Performance Distributed Computing
lower session. The pop operation is used by the demux operation of a protocol object to
send a message to its session. As a message is passed between sessions, information is
added to the header and the message may be split up into several messages. If a message
travels from a device to the user level, it can be put into a larger message. From the user
level to the device, a header can be added.
The X-Kernel has been designed to successfully implement different protocols, and this
kernel can be used to research new protocols. Figure 5.12 demonstrates a collection of
protocol objects that are supported by the X-Kernel. It is clear from this figure that the
TCP protocol is supported directly by user programs. The use of the object model to
represent any protocol makes the interface clean; any protocol can access any other
protocol. For the user to access a protocol, it makes itself a protocol and then accesses the
target protocol.
Figure 5.12 X-Kernel protocol suite
Naming and File Service:
The X-Kernel file system allows its users to access any X-Kernel resource regardless of
the user location. It provides a uniform interface by implementing a logical file system
that can contain several different physical file systems. Furthermore, the logical file
system concept allows the file system to be tailored to user's requirements instead of
being tied to a machine architecture. Figure 5.13 shows a private file system tree
structure that contains several partitions (proj1, proj2, twmrc, journal, conf and original).
Each partition can be implemented using different physical file systems.
159
High Performance Distributed Computing
Formatted: Font: 小四
1
proj1
src
proj2
bin
doc
journal
src
twmrc
paper
conf
original
Figure 5.13 File protocol configuration
The file system is logical in that it provides only directory service and relies on an
existing physical file system for the storage protocols. The Logical File System (LFS)
maps a file name to the location it can be found. The file system has two unique features.
First, each user application defines its own private file system created from the existing
physical file system. The separation of directory functions from the storage functions is
achieved through the use of two protocols: The Private Name Space (PNS) protocol that
implements the directory function and the Uniform file Access (UFA) that implements
the storage access to a given file system.
LFS
PNS
UFA
NFS
AFS
FTP
Figure 5.14 Private file hierarchy
In a similar manner to the services provided by the X-Kernel logical file system, the XKernel command interpreter provides uniform access to a heterogeneous collection of
network services. This interpreter is different from the services offered by distributed
operating systems in that its services not only provide the local resources available in a
local network, but also provide resources which are available throughout a wide area
network.
160
High Performance Distributed Computing
WebOS (Operating System for Wide Area Applications)
Goals and Advantages
This Operating System was developed at the University of California, Berkeley and aims
to provide common set of Operating System services to wide-area applications. The
system offers all the basic services such as global naming, cache coherent file system,
resource management and security. It simplifies the development of dynamically
reconfiguring distributed applications.
System Description Overview
The WebOS components together provide the wide-area analogue to local area operating
system services, simplifying the use of geographically remote resources. Since, most of
the services are geographically distributed, client applications should be able to identify
the server, which can give the best performance. In WebOS, global naming includes
mapping a single service identity to multiple servers; mechanism for load balancing
among available servers and maintaining enough state to perform fail over if a server
becomes unavailable. The above functions are accomplished with the use of Smart
Clients that extend service-specific functionality to client machines.
Wide scale sharing and replication are implemented through a cache coherent wide area
file system. WebFS is an integral part of this system. The performance, interface and
caching are comparable to the existing distributed file systems.
WebOS defines a model of trust providing both security guarantees and an interface for
authenticating the identity of principals. Fine-grained control of capabilities is provided
for remote process execution on behalf of principals. The system is responsible for
authenticating the identity of the user, who requested the remote process execution and
the execution should be as natural and productive as local operation.
Resource Management
A resource manager on each WebOS machine is responsible for job requests from remote
sites. Prior to execution, resource manager authenticates the remote principal identity and
determines if access rights are available. The resource manager creates a virtual machine
for process execution so that running processes do not interfere with one another.
Processes will be granted variable access to local resources through the virtual machine
depending on the privileges of the user responsible for creating the process. The local
administrator on a per-principal basis sets configuration scripts and these configuration
scripts determine the access rights to the local file system, network and devices.
WebOS also uses virtual machine abstraction as the basis for local resource allocation. A
process runtime priority is set using the system V priocnt 1 system call and setrlimit is
used to set the maximum amount of memory and CPU usage allowed.
161
High Performance Distributed Computing
Naming
WebOS provides a useful abstraction for location independent dynamic naming. Client
applications can identify representatives of geographically distributed and dynamically
reconfiguring services using this abstraction and choose the appropriate server based on
load conditions and end-to-end availability.
In order to provide the functionalities mentioned above, first each service name is
mapped to a list of replicated representatives providing the service. Then, the server
capable of best performance is selected and the choice is dynamic and non-binding.
Enough state is maintained to recover from failure due to unavailability of a service
provider.
Naming in WebOS is in the context of HTTP service accessed through URL’s. Ideally,
users refer to a particular service with a single name and the system translates the name
to the IP address of the replica that will provide the best service. The selection decision is
based on factors such as load on each server, traffic conditions of the network and client
location. Loading application and server specific code into end clients performs the name
translation. Extensions of server functionality can be dynamically loaded onto the client
machine. These extensions are distributed as java applets
Two cooperating threads make up the architecture of a Smart Client. A customizable
graphical interface thread implements the user view of the service and a director thread
performs load balancing among the representative servers and maintains state to handle
failures. The interface and director threads are extensible according to the service. The
load balancing algorithm uses static state information such as available servers, server
capacity, server network connectivity, server location and client location as well as load
information piggy-backed with some percentage of server responses. The client then
chooses a server based on static information biased by server load. Inactive clients must
initially use only static information, as the load information would have become stale.
File Service
WebOS provides a global cache coherent, consistent and secure file system abstraction
that greatly simplifies the task of application programmers. The applications in a wide
area network are diverse and may require different variations in the abstraction. Some
applications may require strong cache consistency, while some others are more
concerned on reduction in overhead and delay. WebFS allows applications to influence
the implementation of certain key abstractions depending on their demands. A list of
user-extensible properties is associated with each file to extend basic properties such as
owner and permissions, cache consistency policy, prefetching and cache replacement
policy and encryption policy. WebFS uses a URL-based namespace, and the WebFS
daemon uses HTTP for access to standard web sites. This provides backward
compatibility with existing distributed applications
162
High Performance Distributed Computing
Figure 5.15 Graphical Illustration of the WebFS Architecture
WebFS is built at the UNIX vnode layer with tight integration to the virtual memory
system for fast cached accesses. The WebFS system architecture consists of two parts: a
user-level daemon and a loadable vnode module. The various steps in the read to a file in
the WebFS namespace are: - Initially, the user-level daemon spawns a thread, which
makes a system call intercepted by the WebFS vnode layer. The layer then puts that
process to sleep until work becomes available for it. When an application makes the read
system call requesting on a WebFS file, the operating system translates the call into a
vnode read operation. The vnode operation checks to see if the required page is cached in
virtual memory. If the page was not found in virtual memory, one of the sleeping threads
is woken up and the user level daemon is then responsible for retrieving the page, by
contacting a remote HTTP or WebFS daemon (See Figure 5.15).
Once the required page is found, the WebFS daemon makes a WebFS system call. The
retrieved page is cached for fast access in the future. Presence of multiple threads ensures
concurrent file access. The advantages offered by this system include improved
performance due to caching, global access to HTTP namespace due to presence of vnode
layer and easy modification of the user level daemon.
The Cache consistency protocol for traditional file access in WebFS is “Last Writer
Wins” (Figure 5.16). IP multicast-based update/invalidate protocol is used for widely
shared, frequently updated data files. Multicast support can also be useful in the context
of Web browsing and wide-scale shared access in general. Use of multicast to deliver
updates can improve client latency while simultaneously reducing server load.
Figure 5.16 Implementation of “Last Writer Wins”
163
High Performance Distributed Computing
Security
A wide area security system provides fine-grained transfer of rights between principals in
different administrative domains. The security abstraction of WebOS transparently
enables such rights transfer. The security system is called CRISIS and it implements the
transfer of rights with the help of lightweight and revocable capabilities called transfer
certificates. They are signed statements granting a subset of the signed principals'
privileges to a target principal. All CRISIS certificates must be signed and counter signed
by authorities trusted by both the service provider and the consumer. Stealing keys is
extremely difficult as it involves subverting two separate authorities. Transfer certificates
can be revoked before timeout in the case of stolen keys (see Figure 5.17).
Each CRISIS node runs a security manager, which controls the access to local resources
and maps privileges to security domains. A security domain is created for each login
session containing the privileges of the principal who successfully logged in. CRISIS
associates names with a specific subset of principals' privileges and these are called roles.
A user creates a role by generating an identity certificate containing a new public/private
key pair and a transfer certificate that describes the subset of principals privileges
transferred to the role.
For authorization purposes, CRISIS maintains access control lists (ACL) to associate
principals and groups with the privileges granted to them. File ACLs contain permissions
given to principals for read, write or execute. Process execution ACLs contain the list of
principals permitted to run jobs on the given node. A reference monitor verifies all the
certificates for expiry and signatures and reduces all certificates to the identity of single
principals. The reference monitor checks the reduced list of principals against the
contents of the object ACL, granting authorization if a match is found.
Figure 5.17 Interaction of CRISIS with Different Components
164
High Performance Distributed Computing
2K Operating System
Goals and Advantages
2K aims to offer distributed operating system services addressing the problem of
heterogeneity and dynamic adaptability. One of the important goals of this system is to
provide dynamic resource management for high-performance distributed applications. It
has a flexible architecture, enabling creation of dynamic execution environments and
management of dependencies among the various components.
System Description Overview
2K Distributed Operating System is being researched at the University of Illinois,
Urbana-Champaign. 2K provides an integrated environment to support: dynamic
instantiation, heterogeneous CPUs, and distributed resource management. 2K operates as
configurable middleware and does not rely solely on the TCP/IP communications
protocol. Instead, it provides a dynamically configurable reflective ORB providing
CORBA compatibility. The heart of this project is an adaptable microkernel. This
microkernel provides “What you need is what you get” (WYNIWYG) support. Only
those objects needed by application are loaded. Resource management is also object
based. All elements are represented as CORBA objects and each object has a networkwide identity (similar to Sombrero). Upon dynamic configuration, objects that constitute
a service are assembled by the microkernel. Applications have access to system’s
dynamic state once they have negotiated a connection. Each system node executes a
Local Resource Manager (LRM). Naming is CORBA compliant and object access is
restricted to controlled CORBA interfaces.
Figure 5.18 2K OVERALL ARCHITECTURE
165
High Performance Distributed Computing
Resource Management Service
The Resource Management Service is composed of a collection of CORBA servers. The
services offered by this system are
• Maintaining information about the dynamic resource utilization in the distributed
system
• Locating the best candidate machine to execute a certain application or
component based on its QoS prerequisites
• Allocating local resources for particular applications or components
Local Resource Managers (LRMs) present in each node of the distributed system are
responsible for exporting the hardware resources of a particular node to the whole
network. The distributed system is divided in clusters and a Global Resource Manager
(GRM) manages each cluster. LRMs send periodic updates to the GRM on the
availability of their resources (see Figure 5.19). The GRM performs QoS-aware load
distribution in its cluster based on the information obtained from the LRMs. Efforts are
underway to combine GRMs across clusters, which will provide hardware resource
sharing through the Internet. Although LRMs check the state of their local resources
frequently (e.g., every ten seconds), they only send this information to the GRM when
there were significant changes in resource utilization since the last update or a certain
time has passed since the last update was sent. In addition, when a machine leaves the
network, the LRM deregisters itself from the GRM database. If the GRM does not
receive an update from an LRM for a certain time interval, it assumes that the machine
with that LRM is inaccessible.
Figure 5.19 Resource Management Service
The LRMs are also responsible for tasks such as QoS-aware admission control, resource
negotiation, reservation and scheduling of tasks in the individual nodes. These tasks are
accomplished with the help of Dynamic Soft Real-Time Scheduler, which runs as a user
level process in conventional operating systems. The system’s low-level real-time API is
utilized to provide QoS guarantees to applications with soft real-time requirements.
2K uses a CORBA trader to supply resource discovery services. Both the LRM and the
GRM export an interface that let clients execute applications (or components) in the
166
High Performance Distributed Computing
distributed system. When a client wishes to execute a new application, it sends a request
with the QoS specifications to the local LRM. The LRM checks whether the local
machine has enough resources to execute the application. If not, it forwards the request to
the GRM, which uses its information about the resource utilization in the distributed
system to select a machine, which is capable of executing that application. The request is
forwarded to the LRM of the machine selected. The LRM of that machine tries to
allocate the resources locally, if it is successful, it sends a one-way ACK message to the
client LRM. If it is not possible to allocate the resources on that machine, it sends a
NACK back to the GRM, which then looks for another candidate machine. If the GRM
exhausts all the possibilities, it returns an empty offer to the client LRM.
When the system finally locates a machine with the proper resources, it creates a new
process to host the application. The Automatic Configuration Service fetches all the
necessary components from the Component Repository and dynamically loads them into
that process.
Automatic Configuration Service
Automatic Configuration Service aims to automate the process of software maintenance
(Figure 5.20). The objective of this Automatic Configuration Service is to provide
Network-Centrism and implement ``What You Need Is What You Get'' (WYNIWYG)
model. All the network resources, users, software components, and devices in the
network are represented as distributed objects. Each entity has a network-wide identity, a
network-wide profile, and dependencies on other network entities. When a particular
service is configured, the entities that constitute that service are assembled dynamically.
Presence of a single network-wide account and a single network-wide profile for a
particular user is the highlight of the network centric model. The access to a users profile
is available throughout the distributed system. The middleware is responsible for
instantiating user environments dynamically according to the user's profile, role, and the
underlying platform.
In the What You Need Is What You Get (WYNIWYG) model, the system configures itself
automatically and only the essential components required for the efficient execution of
the users application are loaded. The components are downloaded from the network, so
only a small subset of system services are needed to bootstrap a node. This model offers
an advantage over the existing operating systems and middleware, as it does not carry
along unnecessary modules, which are not needed for the execution of the specified user
application.
Each application, system, or component has specific hardware and software requirements
and the collection of these requirements is called Prerequisite Specifications or, simply,
Prerequisites. The Automatic Configuration Service will also have to take care of the
Dynamic Dependencies between the various components during runtime. CORBA
objects called Component Configurators store these dependencies as lists of CORBA
Interoperable Object References (IOR) linking to other Component Configurators,
forming a dependence graph of distributed components.
167
High Performance Distributed Computing
Figure 5.20 Automatic Configuration Framework
Dynamic Security
CORBA interfaces are used for gaining access to the services offered by 2K. The OMG
Standard Security Service incorporates authentication, access control, auditing, object
communication encryption, non-repudiation and administration of security information.
CORBA Security Service is implemented in 2K utilizing the cherubin security
framework to support dynamic security policies. Security policies vary depending on the
prevailing system conditions and this dynamic reconfiguration is introduced with the help
of reflective ORBs. The implementation supports various access control models. The
flexibility provided by the security system in 2K is helpful to different kinds of
applications.
Reflective ORB
A CORBA-compliant reflective ORB, dynamicTAO offers on-the-fly reconfiguration of
the ORB internal engine and applications running on it. The Dynamic Dependencies
between ORB components and application components are represented using
ComponentConfigurators in dynamicTAO. The various subsystems that control security,
concurrency and monitoring can be dynamically configured with the support of
dynamicTAO. An interface is present for loading and unloading modules into the system
168
High Performance Distributed Computing
runtime and for changing the ORB configuration state. The fact that dynamicTAO is
heavyweight and uses up substantial amount of resources, makes them inappropriate for
environments with resource constraints. A new architecture was developed for the 2K
system enabling adaptability to resource constraints and wide range of applications,
known as LegORB. LegORB occupies less memory space and is suited for small devices
as well as high-performance workstations.
Web Operating System (WOS)
Goals and Advantages
WOS was designed to tackle the problem of heterogeneity and dynamism present in web
and the Internet. WOS uses different versions of a generic service protocol rather than
fixed set of operating system services. There are specialized versions of the generic
protocol, developed for parallel/distributed applications and for satisfying high
performance constraints of an application.
System Description Overview
To cope with the dynamic nature of the Internet, different versions of Web Operating
System are built based on demand-driven configuration techniques. Service classes are
implemented, suited to specific user needs. Communication with these service classes is
established with the help of specific instances of the generic service protocol (WOSP),
called versions of WOSP. The specific versions of WOSP represent the service class they
support. Addition of a service class to a WOS node is independent of the presence of
other service classes and does not require any re-installation. The different protocol
versions occupying a node constitute the local resources for that node. Service classes are
added or removed dynamically based on user demands. Resource information is stored in
distributed databases called warehouses. Each WOS node contains a local warehouse,
housing information about local as well as remote resources. The entire set of WOS
nodes constitute the WOSNet or WOSspace.
Resource Management
This system manages resources by adopting a decentralized approach for resource
discovery and allocation. Allocation also involves software resources required for a
service. Users should be registered in WOSNet and Hardware platform, operating system,
programs and other resources should be declared for public access by a machine, for
other constituents to use them. In order to reduce the overhead involved in conducting
searches and resource trading, statistic methods are used to define a standard user space.
This information includes the typical processes started by a user and the hosts preferred
by the user to execute the services.
The local system examines the requirements of a user job and determines if it can be
executed with local resources. In the case of inadequate local resources, these requests
169
High Performance Distributed Computing
are sent to the user resource control unit. This unit looks up the standard user space for
fulfilling the request with due considerations for load sharing. If the standard user space
proves to be insufficient, a search is launched by the Search Evaluation Unit, which also
evaluates the results of search
Architecture
The services offered by WOS are accessed through a user interface, which also displays
the result of execution. The Host Machine Manager performs the task of handling service
requests, answering queries regarding resource availability and taking care of service
execution. The User Manager with the help of knowledge obtained from local warehouse,
allocates and requests resources needed by a service.
Communication Protocols
The discovery/location protocol (WOSRP) is responsible for the discovery of the
available versions of WOSP. It locates the nodes supporting a specific version and
connects to WOS nodes, which implement a specific version. The versions of WOSP
differ only in the semantics they convey and hence a common syntax can be used for the
purpose of transmission. The WOSP parser is responsible for syntax conversion. The
WOSP Analyzer module is configured to support various versions of WOSP and the
Figure 5.21 Architecture of a WOS Node
information from the local warehouse is used to access a particular instance of the
Analyzer. Figure 5.22 shown below represents the various functional layers.
170
High Performance Distributed Computing
Figure 5.22 Services for Parallel/Distributed (PD) Applications and High Performance
Computing
A specialized class services PD applications and this specific version of WOSP is known
as the Resource Discovery WOS Protocol (RD-WOSP). The PD application is assumed
to be split into modules with separate performance and resource requirements for each
module. The execution of each module is assigned to a WOS node capable of providing
the necessary resources for that module. The PD application is thus mapped to a virtual
machine consisting of many WOS nodes executing the various modules of the
application. The services are invoked through the user interface or by calling the
appropriate routines. The arguments to the routines are the service requested and the
identity of the application requiring the service. The various service routines available for
RD-WOSP are
Discovery Service Routine: This routine locates the set of WOS nodes with the ability to
satisfy the resource requirements of the application.
Reservation Service Routine: This routine returns a value based on whether the
reservation was granted or not.
Setup Service Routine: The value returned by this routine is true if the setup of a module
was successful and false if unsuccessful.
The high performance constraints of an application are met with the help of a service
class called High Performance WOSP (HP-WOSP). Like the RD-WOSP, it also offers
services for discovery, reservation and setup. HP-WOSP is essentially an extension of
RD-WOSP. An HP application is decomposed into a granularity tree with its root
representing the entire application and the leaves representing atomic sequential
processes. Each vertex of the tree can be treated as a module and has its own high
performance constraints in terms of bandwidth, latency and CPU requirements. The HP171
High Performance Distributed Computing
WOSP discovery service identifies a set of WOS nodes possessing the required resources
to execute a subset of the granularity tree vertices. Factors such as load balancing and
network congestion are considered while selecting the appropriate nodes. Thus, the Web
Operating System acts as a meta-computing tool for supporting PD and High
Performance applications.
Trends in Distributed Operating Systems
In this section, we explore the future of distributed operating systems. We do this by
examining distributed operating systems currently being researched.
SOMBRERO
The first system we look at is Sombrero being researched at Arizona State University.
The underlying concept is a Very Large Single Address Space distributed operating
system. In the past, address spaces were limited to those that could be addressed by 32 bit
registers (4 GB). With new computers, we now have the ability to manipulate 64 bit
addresses. In a 64 bit address space, we could generate a 32 bit address space object once
a second for 136 years and still not run out of unique addresses. With this much available
address space, a system-wide set of virtual addresses is now possible. A virtual address is
permanently and uniquely bound to every object. This address spans all levels of storage
and is directly manipulated by the CPU; no address translation is required. All physical
storage devices are viewed as caches for the contents of virtual objects. By utilizing a
single address space, overhead is reduced for inter-process communications and file
system maintenance and access. Furthermore, we eliminate the need for multiple virtual
address spaces. All processor activities are distributed and shared by default. This
provides for transparent inter-process communication and unified resource management.
Threads may be migrated with no changes to address space. Security is provided by
restricting access by OBJECT.
Network Hardware IS the O.S.
The third system is being researched by the University of Madrid. Their unique approach
is to develop an adaptable and flexible distributed system where the Network Hard is
considered the operating system. They begin with a Minimal Adaptable Distributed
microkernel. Their goal is to “build distributed-microkernel based Operating Systems
instead of microkernel based distributed systems.” The entire network is considered
exported and multiplexed hardware instead of isolated entities. “Normal” micro-kernels
multiplex local resources only while these adaptable micro-kernels multiplex local and
REMOTE resources. The only abstraction is the shuttle which is a program counter and
its associated stack pointer. The shuttle can be migrated between processors.
Communication is handled by portals, a distributed interrupt line that behaves like active
messaging. Portals are unbuffered. The user determines whether communication is
172
High Performance Distributed Computing
synchronous, asynchronous, or RPC. While this system does not utilize a single address
space, physical addresses may refer to remote memory locations. This is allowed by
using a distributed software translation look-aside buffer (TLB).
Virtually Owned Computers
The last operating system is the Virtually Owned Computer being researched at the
University of Texas at Austin. Each user in this system owns an imaginary computer-a
virtual computer. The virtual computer is only a description of resources the user is
entitled to and may not correspond to any real physical hardware components. The virtual
computer consists of a CPU and a scheduling algorithm. Each user is promised a given
quality of service or expected level of performance. Services received by the user are
independent of the actual execution location.
Summary
In this chapter, we reviewed the main issues that should be considered for designing and
evaluating distributed operating systems. These design issues are: system model and
architecture, inter-process communications, resource management, name and file services,
security, and fault tolerance. We have also discussed how some of these issues are
implemented in representative distributed operating systems such as Locus, Amoeba, V,
Mach and X-kernel. The future of distributing computing appears to include large single
address spaces, unique global identifiers for all objects, and distributed adaptable microkernels.
Distributed operating systems have also been characterized with respect to their ability to
provide parallel and distributed computing over a Network of Workstations (NOW). In
such an environment, one can compare distributed systems in terms of their ability to
provide remote execution, parallel processing, design approach, compatibility, and fault
tolerance [keet95]. Table 5.1 shows a comparison between previous distributed
operating systems when they are used in NOW environment [keet95].
GLUNix
Accent
Amber
Amoeba
Butler
Charlotte
Clouds
Condor
Demos/MP
Eden
NetNOS
Locus
Mach
NEST
Newcastle
Piranha
Remote Ex.
RX
TR Mi
Y
Y
Y
Na
Y
Na
Y
Y
Y
Y
Y
N
Y
N
N
Y
Y
Y
Y
Y
Y
Y
Yb
Y
Y
Y
Y
Y
Y
Y
Y
Y
N
Y
Y
N
Y
Y
Y
Y
Y
N
N
N
N
Y
Y
N
PJ
Y
N
Y
Y
N
Y
Y
N
N
Y
Y
N
Y
N
N
Y
Parallel Jobs
JC
GS DR
Y
Y
Y
N
N
N
N
N
N
?
?
?
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
Y?
N?
N?
N
N
N
N
N
N
N
N
N
N
N
N
Y
N
Y
173
Design
IR DC
Y
Y
N
N
N
Y
Y
N
Y
Y
Y
Y
N
Y
Y
N
Y
Y
N
Y
Y
Y
N
Y
Y
Y
Y
Y
N
Y
Y
Y
Compatibility
UL EA HP
Y
Y
Y
N
N
N
Y
N
N
N
N
Y
Y
Y
Y
N
N
N
N
N
N
Y
N
Y
N
Y
N
Yc
Y
Y
Y
Y
Y
N
Y
Y
N
Y
Y
Yc
Y
Y
Y
Y
Y
Y
N
Y
Fault Tol.
FT CK
Y
Y
Y
N
N
N
N
Y
N
N
N
N
N
N
Y
Y
N
N
Y
Y
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
High Performance Distributed Computing
Plan 9
Sidle
Spawn
Sprite
V
VaxCluster
Y
Y
Y
Y
Y
Y
Y
N
N
Y
Y
?
N
N
N
Y
Y
N
N
N
N
N
Y
N
N
N
N
N
Y
N
N
N
N
N
N
N
N
N
N
N
N
N
Y
Y
Y
Y
Y
Y
N
Y
Y
N
Y
Y
N
Y
Y
N
N
N
N
Y
Y
Y
Yd
Y
Y
Y
Y
Y
Y
N
N
Y
Y
N
Y
Y
Table 5.1 Previous Work in Distributed Operating Systems NOW Retreat
174
N
N
N
N
N
N
High Performance Distributed Computing
References
[Dasg91] Dasgupta, LeBlanc Ahamad and Ramachandran, "The Clouds Distributed
Operating System", Computer, Nov. 1991, pg.34.
[Fort85] Fortier, Paul J., Design and Analysis of Distributed Real-time Systems.
McGraw-Hill Book Company, 1985, pg 103.
[Anan91] Ananda, A. L. and Srinivasan, B., "Distributed Operating Systems Distributed
Computing Systems: Concepts & Structures, 1991,pp133-135.
[Mull87] Mullender,J., "Distributed Operating Systems", Computer Standards &
Interfaces, Volume 6, 1987, pp 37-44.
[Walk83] Walker,.B G. Popek, R. English, C. Kline, G. Thiel, "The LOCUS
Distributed Operating System", Proceedings of 1983 SIGOPS
Conference, 1983, pp 49-70.
[Mull90] Mullender, J.and Rossum, G. van and Tanenbaum, A.5 and Renesse,R van,
"Amoeba, A Distributed Operating System for the 1990s",Computer, January, 1990, pp
44-51.
[Rashid86] Rashid, Richard , "Threads of a New System," UNIX Review,August 1986
[Fitz86] Fitzgerald, Robert et al. "The Integration of Virtual Memory Management and
Interprocess Communication in Accent," ACM Trans. On Computer Systems, May 1986
[Teva89] Tevanian, Avadis Jr. et al. "Mach The Model For Future UNIX," BYTE,
November 1989
[Rash89] Rashid, Richard. "A Catalyst for Open Systems," Datamation, May 1989
[Blac90] Black, David L. "Scheduling Support for Concurrency and Parallelism in the
Mach Operating System," IEEE Computer, May 1990
[Byan91] Byant, R. M. et al. "Operating system support for parallel programming on
RP3," IBM J. RES. DEVELOP. Vol. 35 No5/6,September/November 1991
[Andr92] Andrew 5. Tanenbaum. "Modern Operating System," Prentice Hall, N.J., 1992
[Hutc89] N. C. Hutchinson, L. L. Peterson, M. B. Abbott, and S. O'Malley, "RPC in the
x-kernel: Evaluating New Design Techniques'', Proceedings of the Twelfth Symposium
on Operating Systems Principles, 23(5), 91--101, In ACM Operating systems Review 23
(5).
175
High Performance Distributed Computing
[Anan91] Ananda, A. L. and Srinivasan, B., "Distributed Operating Systems Distributed
Computing Systems: Concepts & Structures, 1991,pp133-135.
WebOS: Operating System Services For Wide Area Applications, Amin Vahdat, Thomas
Anderson, Michael Dahlin, David Culler, Eshwar Belani, Paul Eastham, and
ChadYoshikawa.July1998
The Seventh IEEE Symposium on High Performance Distributed Computing
WebFS: A Global Cache Coherent File system, Amin Vahdat, Paul Eastham, and
Thomas Anderson. December 1996
Technical Draft
The CRISIS Wide Area Security Architecture, Eshwar Belani, Amin Vahdat, Thomas
Anderson,andMichaelDahlin.January1998
Proceedings of the 1998 USENIX Security Symposium
G. Kropf, «Overview of the Web Operating System project», High Performance
Computing Symposium 99. The Society for Computer Simulation International, San
Diego, CA, pp. 350 356, April 1999
N. Abdennadher, G. Babin and P.G. Kropf. «A WOS based solution for high
performance computing», IEEE-CCGRID 2001, Brisbane, Australia, pp. 568573, Mai
2001
2K: A Distributed Operating System for Dynamic Heterogeneous Environments, Fabio
Kon, Roy Campbell, M. Dennis Mickunas, Klara Nahrstedt, and Francisco J. Ballesteros.
9th IEEE International Symposium on High Performance Distributed Computing.
Pittsburgh. August 1-4, 2000
Dynamic Resource Management and Automatic Configuration of Distributed Component
Systems Fabio Kon, Tomonori Yamane, Christopher Hess, Roy Campbell, and M.
Dennis Mickunas. Proceedings of the 6th USENIX Conference on Object-Oriented
Technologies and Systems (COOTS'2001), San Antonio, Texas, January, 2001
176
Chapter 5
Architectural Support for
High-Speed Communications
165
166CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
5.1 Introduction
With the development of high speed optical bers network speeds are shifting towards
gigabits per second. Given this bandwidth the slowest part of computer communication is no longer physical transmission. In order to eectively utilize the bandwidth
available in high speed networks, computers must be capable of switching and routing packets at extremely high speeds. This has moved the bottleneck from physical
transmission to protocol processing. There are several challenging research issues that
need to be addressed in order to solve the protocol processing bottleneck so that high
performance distributed systems can be designed. Some of these issues are outlined
below:
1. The host-to-network interface imposes excessive overhead in the form of processor cycles, system bus capacity and host interrupts. Data moves at least twice
over the system bus 15]. The host is interrupted by every received packet.
With bursty trac, the host barely has time to do any computation jobs while
receiving packets. Further, with synchronous send/receive model, the sender
blocks until the corresponding receive is executed. There is no overlap between
computation and communication. In the asynchronous send/receive model,
messages transmitted from sender are stored in the receiver's buer until they
are read by the host in the receiving side. Since the operating system is involved in message receiving, the asynchronous send/receive model also brings
heavy overhead. Therefore, the interface should be designed to o-load communication tasks from the host 15], to reduce the number of times data is
copied and to increase the overlap of computation and communication as much
as possible.
2. Conventional computer networks use shared medium architectures (ring or bus).
With the increasing number of users and the number of applications that have
intensive communication requirements, the shared medium architectures are unlikely to support low latency communications. Switch-based architectures, that
allow several message passing activities to exist simultaneously in the network,
should be considered as a potential alternative to replace the shared medium
architectures 19].
3. A necessary condition for low-latency communication in parallel and distributed
computing is that ne-grain multiplexing be supported eciently in time-multiplexing
network resources. A typical method of achieving ne-grain multiplexing is to
5.1 INTRODUCTION
167
split each message into small xed size data cells, as in ATM where messages are
transmitted and switched in a sequence of 53-byte cells as discussed in Chapter
3.
Currently, the techniques that have been proposed to address these problems focus on improving one or more of the main components (networks, protocols, and
network interfaces) of the communication layer of the distributed system reference
model. Faster networks can be achieved by using high speed communication lines
(e.g., ber optics) and high speed switching devices (e.g, ATM switches). High speed
communication protocols can be developed by a combination of one or more of three
techniques: developing new high speed protocols, improving existing protocol structure, and implementing protocol functions in hardware. Faster network interfaces can
improve the performance of the communication system by o-loading the host from
running the protocol tasks (e.g., data copying and protocol processing) and run them
instead on high speed adapter boards. Figure 5.1 summarizes the techniques that
have been proposed in the literature to improve the communications subsystems.
High-Speed
Communications
(Layer - 1)
High-Speed
High-Speed
(chap. 3)
Networks
LINs
* HIPPI
LANs
* ATM
(chap. 4)
Network Interface
/Switches
Protocols
MANs
WANs
* FDDI * DQDB * SMDS * Frame Relay * ATM
* Fiber Channel * ATM
(chap. 2)
New Protocols
VMTP
XTP
Improved Structure
XTP
Hardware
Implementation
High Speed Switching
(chap. 5)
High-Speed Adapters
Protocol Engine
STM Switche
CAB
HOPS
(fabric)
NAB
NETBLT
HIP
Figure 5.1: Research Techniques in High-Speed Communications.
In this chapter we focus on architectural support for a high-speed communication
system that has a signicant impact on the design of high performance distributed
168CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
systems. We discuss important design techniques for high speed switching devices
and host network interfaces, and hardware implementations of standard transport
protocols.
5.2 HIGH SPEED SWITCHING
5.2 High Speed Switching
169
An important component of any computer network is the interface communication
processor, which is also referred to as a switch or fabric. The function of the switch
is to route and switch packets or messages oating in the network. In this section,
we will address the design architectures of a high speed ATM switch. The design
of ATM packet switches that are capable of switching relatively small packets at
rates of 100,000 to 1,000,000 packets per second per line is a challenging task. There
have been a number of ATM switch architectures proposed, many of which will be
discussed in the following sections. Also, the various performance characteristics and
implementation issues will be discussed for each architecture.
Architectures for ATM switches can be classied into three types: shared memory,
shared medium and space-division. An ATM packet switch can be viewed as a black
box with N inputs and N outputs, which routes packets received on its N inputs to N
outputs based upon the routing information stored in the header of the packet. For
the switches covered in the following sections, the following assumptions are made:
All switch inputs and outputs have the same transmission capacity (V bits=s).
All packets are the same size.
Arrival times of packets at the switch inputs are time-synchronized and thus
the switch can be considered a synchronous device.
5.2.1 Shared Memory Architectures
Conceptually, a shared memory ATM switch architecture consists of a set of N inputs,
N outputs and a dual-ported memory as shown in Figure 5.2. Packets received on
the inputs are time multiplexed into a single stream of data and written to packet
rst-in, rst-out (FIFO) queues formed in the dual-ported memory. Concurrently,
a stream of packets is formed by reading the packet queues from the dual-ported
memory. The packets forming this data stream are demultiplexed and written to the
set of N outputs.
Shared memory architectures are inherently free of internal blocking which is characteristic of many Banyan based and space division based switches. Shared memory
architectures are not free from output blocking, though. It is possible during any
time slot that two packets input to the switch may be destined for the same output.
170CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
N Inputs
Shared
Memory
N Outputs
Control
Transfer rate of each input and output = V bits/second
Figure 5.2: Shared Memory Switch Architecture
Because of this probability and because the data rate of the switch outputs is commonly the same as the switch inputs there must be buers for each output or packets
may be lost. The output buers for the shared memory switch is the shared memory
itself. The shared memory acts as N virtual output FIFO memories. The required
size of each of the output buers is derived from the desired packet loss probability
for the switch. By modeling the expected average and peak loads seen by the switch,
buer size can be calculated to insure packet loss probabilities.
The shared memory concept is simple but suers from practical performance limitations. The potential bottleneck in the implementation of this architecture is the
bandwidth of the dual-ported memory and its control circuitry. The control circuitry
which directs incoming packets into virtual FIFO queues must be able to determine
where to direct N incoming packets. If the control circuitry cannot keep up with the
ow of incoming packets, packets will be lost. Additionally, the bandwidth of the
dual-ported memory must be large enough to allow for data transfers of N packets
times the data rate V of the ATM switch ports for both the input and output. Given
this criteria the memory bandwidth must be 2NV bits/sec. The number of ports
and the port speeds of a switching module using a shared memory architecture is
bounded by the bandwidth of the shared memory and the control circuitry used in
5.2 HIGH SPEED SWITCHING
171
its implementation.
Construction of switches larger than what can be built given the hardware limitations can be done by interconnecting many shared memory switching modules in a
multistage conguration. A multistage conguration trades switch packet latency for
a greater number of switch ports.
5.2.2 Shared Medium Architectures
A shared medium ATM switch architecture consists of a set of N inputs, N outputs,
and a common high speed medium such as a parallel bus as shown in Figure 5.3.
Incoming packets to the switch are time multiplexed onto this common high speed
medium. Each switch output has an address lter and FIFO to store outgoing packets.
As the time multiplexed packets appear on the shared medium, each address lter
discards packets which are not destined for that output port and propagate those
which are. Packets which are passed by the address lter are stored in the output
FIFO and then transmitted on the network.
The shared medium ATM switch architecture is very similar to the shared memory switch architecture in that incoming packets are time multiplexed onto a single
packet stream and then demultiplexed into separate streams. The dierence in the
architectures lies in the partitioning of the storage memory for the output channels.
In the shared memory architecture, each parallel channel utilizes the dual-ported
memory for packet storage while in the shared medium architecture each output port
has its own storage memory. The memory partitioning of the shared medium architecture implies the use of a FIFO memory as opposed to a dual-ported memory in its
implementation.
For a shared medium switch, the number of ports and the port speeds of a switching module is bounded by the bandwidth of the shared medium and FIFO memories
used in its implementation. The aggregate speed of the bus and FIFO memories must
be no less than NV bits/second or packets may be lost. Construction of switches
larger than what can be built, given the hardware limitations, by interconnecting
many switch modules in a multistage conguration.
Shared medium architectures, like the shared memory architectures, are inherently
free of internal blocking which is characteristic of many Banyan based and space division based switches. Shared memory architectures are not free from output blocking.
There must be buers for each output or packets may be lost. The buers in this
architecture are the FIFO memories. The required size of each of the output buer
FIFOs is derived from the desired packet loss probability for the switch. By modeling
1
Serial to
Parallel
2
Serial to
Parallel
N Inputs
N
Serial to
Parallel
Time Division Bus
172CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Address
Filter
FIFO
Parallel
to Serial
1
Address
Filter
FIFO
Parallel
to Serial
2
N Outputs
Address
Filter
FIFO
Parallel
to Serial
N
Transfer rate of each input and output = V bits/second
Figure 5.3: Shared-Medium Switch Architecture
the expected average and peak loads seen by the switch, FIFO buer size can be
calculated to insure packet loss probabilities.
It should be noted that the shared medium and shared memory architectures
are guaranteed to transmit packets in the same order in which they were received.
The shared medium and shared memory architectures both suer from memory and
medium bandwidth limitations. This limitation inhibits the potential speed and number of ports of switches designed around these architectures. They also have packet
loss rates which are dependent on the memory size of the switch.
5.2.3 Space Division Architectures
Space division ATM switch architectures are dierent from the shared memory and
shared bus architectures. In a space division switch, concurrent paths are established from the switch inputs to the switch outputs, each path with a data rate of V
bits/second. An abstract model for a space division switch is shown in Figure 5.4.
Space division architectures avoid the memory bottleneck of the shared memory and shared medium architectures since no memory component in the switching
fabric has to run at a rate higher than 2V. Another distinct feature of the space
5.2 HIGH SPEED SWITCHING
173
Buffers
Routers
Outputs
Inputs
Concentrators
Transfer rate of each input and output = V bits/second
Figure 5.4: Abstract Model for Space Division Switches
division architecture is that the control of the switch need not be centralized, but
may be distributed throughout the switching fabric. This type of architecture, however, exchanges memory bandwidth problems for problems unique to space division
architectures.
Depending on the particular internal switching fabric used and the resources available to establish paths from the inputs to the outputs, it may not be possible for all
required paths to be set simultaneously. This characteristic, commonly referred to
as internal blocking potentially limits the throughput of the switch and thus becomes
the central performance limitation of space division switch architectures.
A related issue to space division architectures is buering. In fabrics exhibiting
internal blocking, it is not possible to buer packets at the outputs, as is possible
in shared memory and shared bus type switches. Instead, buers must be located
at places where potential conicts along paths may occur, or upstream to them.
Ultimately buers may be placed at the inputs of the switch. The placement of
buers has an important eect on the performance of a space division switch as well
as on its hardware implementation.
174CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Crossbar Switch
A crossbar switching fabric consists of a square array of 2N crosspoint switches, one for
each input-output pair, as shown in Figure 5.5. A crosspoint switch is a transmission
gate which can assume two states, the cross state and the bar state. Assuming that
all crosspoint switches are in the cross state, to route a packet from input line i to
output line j it is sucient to set the (i,j)th switch to the bar state and leave the
switches (i,k), k=1,2,...,j-1 and the switches (k,j), k=i+1,...,N in the cross state. The
state of any other switch is irrelevant.
1
Bar State
2
Inputs
3
4
Cross State
1
2
3
4
Transfer rate of each input and output = V bits/second
Figure 5.5: Crossbar Fabric
In the crossbar fabric, unlike the shared medium and shared memory switches,
there is no need for output buering since there can be no congestion at the output of
the switch. In this architecture buering at the output is replaced with buering at
the input. One dierence between the output buering of shared medium and shared
memory switches and the input buering of the crossbar switch is that the memory
bandwidth of the crossbar switch need only meet the transfer rate V of the switch
input, not NV as is found in the shared medium and shared memory switches. This
diminishes the performance limitation due to the memory bandwidth as found in the
shared medium and shared memory switches
5.2 HIGH SPEED SWITCHING
175
As long, as there is no output conict, all incoming packets can reach their respective destinations, free from blocking, due to the existence of 2N crosspoints in
the fabric. If, on the other hand, there exists more than one packet in the same slot
destined to the same output, then only one of these packets can be routed to that
output due to the contention for the same arc in the fabric. The remaining packets
must be buered at the input or dropped somewhere within the switch.
The size of the input buers for a crossbar fabric have a direct eect on the packet
loss probability of the switch in the same way the size of the output buers eect the
packet loss probability of the shared memory and shared medium switches. Similarly
to the shared memory and shared medium switches, by modeling the expected average
and peak loads seen by the switch, input buer size can be calculated to insure packet
loss probabilities.
The limitation of a crossbar switch implementation is that it requires 2N crosspoints in the fabric and therefore its realizable size is limited. Crossbar fabrics also
have the drawback of not having constant transit times for all input/output pair
combinations. Also, when self routing is used in this architecture the processing performed at each crosspoint requires knowledge of the complete port address, another
drawback.
Knockout Switch Architecture
A knockout switch ATM architecture, as illustrated in Figure 5.6, consists of a set of
N inputs and N outputs. Each switch input has associated with it, its own packet
routing circuit which routes each input packet to its destined output interface. Each
output interface contains N address lters, one for each input line, which drop packets
not destined for that output port. The outputs of each of the address lters for a
particular switch output are connected to an NxL concentrator which selects up to L
packets of those passing through the N address lters in a given slot. If more than
L packets are present at the input of the concentrator in a given slot, only L will be
selected and the remaining packets will be dropped. The use of the NxL concentrator
simplies the output interface circuitry by reducing the number of output buer FIFO
memories and the control circuit complexity without exceeding the required packet
loss probability of the switch. For uniform trac a packet loss rate of 10-6 is achieved
with L as small as 8, regardless of the switch load and size14].
The Knockout switch architecture is practically free of internal blocking. The
potential for packet loss in the NxL concentrator implies the presence of internal
blocking since the presence of L+1 packets at the input of the NxL concentrator will
176CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
1
2
Broadcast
Input
busses
Inputs
N
Address
Filter
Address
Filter
Address
Filter
NxL Concentrator
Shared Medium Output Buffer
1
2
N Outputs
Figure 5.6: Knockout Switch basic structure 5]
result in packet loss and this loss could conceivably be diminished with the use of
buers at the input of the NxL concentrator.
The Knockout switch architecture is not free from output blocking and therefore
must provide buering for each output or packets may be lost. The buers in this
architecture are the FIFO memories. The required size of each of the output buer
FIFOs is derived from the desired packet loss probability for the switch. By modeling
the expected average and peak loads seen by the switch, FIFO buer size can be
calculated to insure packet loss probabilities. The bandwidth requirements of the
Knockout switch is only LV as compared to the NV bandwidth requirement of the
shared medium switch. This would yield a potential performance gain of N/L for
the Knockout switch over the shared medium switch where memory bandwidth is
concerned.
Although the Knockout switch is a space division switch like the crossbar switch,
each of the inputs has a unique path to the output buers much like the shared
medium switch and thus circumvents the internal blocking problem. The lack of
internal blocking sets the Knockout switch apart from other space division switch
architectures like the crossbar switch architecture.
5.2 HIGH SPEED SWITCHING
177
Integrated Switch Architecture
In an NxN Integrated Switch fabric a binary tree is used to route data from each
switch input to a shift register contained in an output buer as shown in Figure 5.7.
Each shift register is equal in size to one packet. During every time slot, the contents
of all N registers corresponding to a given output line are emptied sequentially into
an output FIFO memory. This function is performed by a multiplexor running at N
times the input line rate 6].
1
1
SR
SR
N
1
N
1
SR
N
N
1
1
FIFO Control
SR
SR
SR
N
N
N
N
SR
N
SR
FIFO Control
Figure 5.7: Integrated Switch Fabric 8]
The Integrated switch architecture is not free from output blocking and therefore
must provide buering for each output or packets may be lost. The required size of
each of the output buer FIFOs is derived from the desired packet loss probability
for the switch. By modelling the expected average and peak loads seen by the switch,
FIFO buer size can be calculated to insure packet loss probabilities.
This architecture, while still a space division architecture, is similar to the shared
medium architecture if the self routing tree and packet storage shift registers are
thought of as the shared medium. The FIFO memory and the circuitry multiplexing
each of the packets stored in the shift registers associated with each output of the Integrated switch must operate at a rate NV bits/second much like the shared medium
178CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
architecture output FIFO memories. This performance limitation inhibits the potential speed and number of ports of switches designed around these architectures. It
also has a packet loss rate which are dependent on the memory size of the output
FIFO memory.
It should be noted that the Integrated switch fabric is guaranteed to transmit
packets in the same order in which they were received..
Banyan-based fabrics
Alternatives to the crossbar switch implementation have been based on multistage
interconnection networks, generally referred to as Banyan networks.
A multistage interconnection network for N inputs and N outputs, where N is a
power of 2, consists of log2N stages each comprising N/2 binary switching elements,
and interconnection lines between the stages placed in such a way as to allow a path
from each input to each output as shown in Figure 5.8.
0
0
1
1
2
3
2
3
4
5
4
5
6
7
6
7
8
8
9
9
10
11
10
11
12
13
12
13
14
15
14
15
Bar State
Cross State
Figure 5.8: Banyan Interconnection Network 5].
The example is constructed by starting with a binary tree connecting the rst
input to all N outputs and then proceeding with the construction of similar trees
for the remaining inputs, always sharing the binary switches already existing in the
network to the maximum extent possible.
5.2 HIGH SPEED SWITCHING
179
An NxN multistage interconnection network possesses the following properties:
There exists a single path which connects an input line to an output line. The
establishment of such a path may be accomplished in a distributed fashion using a
self-routing procedure. All networks allow up to N paths to be established between
inputs and outputs simultaneously. The number of paths is a function of the specic
pattern of requests that is present at the inputs. The networks also possess a regular
structure and exhibit a modular architecture.
The drawback to Banyan type switches is internal blocking and the resulting
throughput limitations. Simulation results have shown that the maximum throughput attainable for a Banyan switch in much lower than for that of a crosspoint switch.
Throughput diminishes with increasing numbers of switch ports for the Banyan
switches. All switch architectures based on Banyan switches are distinguished by
their means of overcoming the shortcomings of internal blocking and increasing a
switches throughput and packet loss performance 6].
One way to enhance the Banyan architecture is to place buers at the points of
routing conicts. Switches of this type are referred to as Buered-Banyan switches.
Another method of enhancement involves the use of input buering and of blocking
packets at the inputs based upon control signals preventing blocking. Performance
can also be improved by sorting input packets in order to remove output conicts and
presenting to the Banyan router, packet permutations which are guaranteed not to
block. Packets not routed in a time slot using method are buered and retried. This
method is referred to the Batcher-Banyan switching fabric 5].
Buered-Banyan Fabrics
In a Buered-Banyan switch, packet buers are placed at each of the inputs of the
cross-point switches. If a conict occurs while attempting to route two packets
through the Banyan fabric, only one of the conicting packets is forwarded. The
other packet remains in the buer and routing is retried during the next time slot.
The size of the buer used in this implementation has a direct, positive eect on the
throughput performance of the switch and its size is chosen to improve the packet
loss performance of the switch.
The positive eect of adding buers to the front end of a Banyan switch is diminished if internal conicts persist within the interconnection network. An illustration of
the potential internal routing congestion is shown in gure 8. This routing congestion
problem is especially true where two heavily loaded paths through the interconnection
network need to share internal link also illustrated in Figure 5.9
180CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Area of Packet Congestion
Inputs
Outputs
Area of Packet Congestion
Figure 5.9: Congestion in a Buered-Banyan switching fabric
To diminish the the eect of internal link congestion a distribution network can be
placed on the front end of the routing network. A distribution network is a Banyan
network used to distribute incoming packets across all of the switch inputs. This is
done by alternately routing packets to each of the outputs of every crosspoint in the
switch, paying no regard to the destination address of the packet. The combination
of distribution network and routing network is equivalent to a Benes interconnection
network as shown in Figure 5.10 11]. The number of possible paths through a Benes
network is greater than one, thus oering the potential for reducing the number of
internal conicts during packet routing. If routing requests are made which result in
an internal conict within the network the routing through the network can be rearranged to eliminate the conict. This property makes the Benes network rearrangably
non-blocking 6].
Batcher-Banyan Fabrics
An alternative way of improving the throughput of a self-routing Banyan network
is to process the inputs before introducing them into the network, as is done in the
Batcher-banyan switch fabric shown in Figure 5.11 ?]. The Batcher-banyan network
is based on the following property. Any set of k packets, k # N, which is free of
5.2 HIGH SPEED SWITCHING
Inputs
181
Outputs
Figure 5.10: An 8x8 Benes network
output conicts, sorted according to output addresses and concentrated at the top k
lines is realizable by the OMEGA network 6].
Incoming packets are are sorted according to their requested output addresses by
a Batcher sorter, which is based on a bitonic sorting algorithm and has a multistage
structure similar to an interconnection network 12]. Packets with conicting output addresses are removed. This is accomplished with a running adder, referred to
usually as the \trap network". Remaining packets are concentrated to the top lines
of the fabric. This may be accomplished by means of a reverse OMEGA network.
Concentrated packets are then routed via the OMEGA network. Packets which are
not selected by the trap network are recirculated, being fed back into the fabric at
later slots. A certain number of input ports, M, are reserved for this purpose, thus
reducing the number of input/output lines the switching fabric can serve. Since the
number of recirculated packets may exceed M, buering of these packets may still be
required. M and the buer size are selected so as to not exceed a given loss rate 5].
Multiple Banyan Switching Fabrics
There are two ways of using multiple banyan switches, in series and in parallel. The
use of multiple Banyan switches helps overcome the problem of internal blocking
182CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Shared
Recirculating Queue
M
Inputs N
M
Batcher
Sorting
Network
N+M
Trap
N+M
Concentration
N
Network
Network
Banyan
N
Outputs
Network
Figure 5.11: Batcher-banyan switch architecture
by providing more paths between inputs and outputs. As shown in ?, ?] multiple
Banyan networks can be used in parallel to reduce the input load on each network.
By reducing the input load of a Banyan network the probability of internal conicts
is reduced, thus increasing the potential throughput of the switch. The outputs of
the parallel Banyan networks are merged into output port buers. The throughput
of the multiple parallel banyan switch improves as the number of parallel networks
are increased.
The tandem Banyan Switching fabric as introduced in ?] overcomes internal
blocking and achieves output buering without having to provide 2N disjoint paths.
It consists of placing multiple copies of the Banyan network in series thus increasing
the number of realizable concurrent paths between inputs and outputs. The switching
elements are also modied to operate as follows. Upon a conict between two packets
at some crosspoint, one of the two packets is routed as requested and the other packet
is marked and routed the other way. At the output of the rst Banyan network packets
which were routed correctly are placed in output port buers and those packets
marked as incorrectly routed are unmarked and placed on the inputs of the second
banyan fabric for further processing. This process is repeated through K banyan
networks in series. Note that the load on successive banyan networks decreases and
5.2 HIGH SPEED SWITCHING
183
therefore so does the likelihood of internal conicts. By choosing a suciently large
K, it is possible to increase the throughput and decrease the packet loss rate to the
desired levels.
184CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
5.3 Host Network Interface
The host-to-network interface imposes excessive overhead in the form of processor
cycles, system bus capacity and host interrupts. The host is interrupted by every
received packet. With bursty trac, the host barely has time to do any computation
jobs while receiving packets. High-performance distributed systems can not provide
high throughput and low latency for their applications unless the host-network interface bottleneck has been resolved. The current advances in the I/O buses that
can operate at 100 Mbytes/second, ecient implementations of standard protocols
and the availability of special-purpose VLSI chips (e.g., HIPPI chips, ATM chips) are
not by themselves sucient to solve the host-network interface problem 29]. There
has been intensive research eorts have been published to characterize the overheads
related to computer network communications ?]. However, it is generally agreed on
that there is no single source of overhead and one needs to streamline all communication functions required to transfer information between a pair of computers. For
example, Figure 5.12 shows the communication functions required to send messages
over the the socket interface 29].
These functions can be grouped in three classes:
1. Application Overhead: This represents the overhead incurred by using socket
system calls to setup the socket connection (for both connectionless and connection oriented service) between the sender and the receiver. This overhead
can be reduced by using lightweight protocols that do not involve heavy overhead to setup using a permanent or semipermanent connection between the
communicating computers.
2. Packet Overhead: This represents the overhead encountered when sending or
receiving packets (e.g. TCP, UDP, IP, medium access protocol, physical layer,
and interrupt handling). This overhead can be improved by using lightweight
protocols and/or running these functions on the host network interface. This
will allow the host to spend more time processing application tasks instead of
performing CPU intensive protocol functions.
3. Data Overhead: This represents the overhead associated with copying and
checksumming the data to be transferred or received. This overhead increases
with the increase in the data size. When the network is operating at high
speed, this overhead will be become the dominating overhead especially when
the size of data to be transferred is large. The main limiting resource is the
5.3 HOST NETWORK INTERFACE
185
Application
System call
Socket processing
System call
Socket processing
Copy data to
system buffers
Copy data to
user space
TCP Protocol
Processing
TCP Protocol
Processing
Calculate
Data Checksum
Calculate
Data Checksum
IP Processing
IP Processing
Create
MAC Header
MAC Processing
Access
Device
Interrupt
Network
Figure 5.12: Communication functions for socket network interface
186CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
memory-bus bandwidth and thus to reduce this overhead, one needs to reduce
the number of bus cycles required for each data transfer. Figure 5.13 depicts the
data movement that occurs during the transmission of a message the inverse
path is followed when a message is received. The dashed line indicates the
checksum calculation. For the traditional host network interface, the bus is
accessed ve times for every transmitted word. This number of accesses will
even be increased further if the host writes the data rst to a device buer
before it is sent to the network buer. The number of transfers can be reduced
to three by writing the data directly to the interface buer instead of using
the system buer (Figure 5.13(b)). In this case the checksum is computed as
the data being copied. In addition to reducing the bus overhead, the use of
the interface buer will allow the transmission of packets at the network speed,
independent of the speed of host bus system. The use of DMA can even reduce
the number of transfers to two (see Figure 5.13(c)). The use of DMA provides
the capability to support burst transfers.
5.3.1 Host Network Interface Architectures
The host-network interface should consume fewer CPU and bus cycles so it can communicate at higher rates and thus can allocate more CPU cycles for application
processing. Also, the architecture of host-network interface should be able to support
a variety of computer architectures (workstations, parallel and supercomputers, and
special-purpose parallel computers). Furthermore the host interface should cost only
a small fraction of the host itself and should be able to run eciently standard and
lightweight communication protocols. The existing host-network interface architectures can be broadly grouped into four categories 30]: operating system based DMA
interfaces, user-level memory mapped interfaces, user-level register mapped interfaces,
and hardwired interfaces.
1. OS-Level DMA-Based Interfaces
The DMA interface handles the transmission and the receiving of messages
under the control of the operating system. At the hardware level, the send
and receive of messages starts by initiating a DMA transfer between the main
memory and the network interface. At the software level, the transmission
of a message is carried out by copying the message into the memory and then
making a SEND system call which initiates the DMA transfer from the memory
to the network interface buer. Similarly, the receiving of messages require the
5.3 HOST NETWORK INTERFACE
Application
System
187
Application
System
Network Interface
Network
Data
Copy
Network Interface
Copy
Network
Data
(a)
(b)
Application
System
Network Interface
Copy
Network
Data
(c)
Figure 5.13: Data ows (a) in traditional network interface, (b) in network interface
with outboard buering, (c) in network interface with DMA.
188CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
program at the receiving side to issue a RECEIVE system call. Since the
operating system is involved to handle sending and receiving of messages, the
latency can be quite high, especially when UNIX Socket system calls are used.
Ecient network interface designs should avoid getting the operating system
involved in sending and receiving messages as much as possible. However, the
interface designs should provide protection means among dierent applications
since the operating system is not involved in the transfer of messages over the
network.
2. User-Level Memory-Mapped Interfaces:
More recent processor-network interface designs make sending and receiving
messages user level operations. The salient feature of this technique is that the
network interface buer can be accessed in a similar latency to that of accessing
the main memory buer. In this scheme, the user process is responsible for
composing the message and executing the SEND or RECEIVE command. The
host is notied about the arrival of a message by either polling the status of
the network interface or by interrupt. The host-network interface buer can be
colocated within the main memory or connected directly to the memory buses
of the host.
3. User-Level Register-Mapped Interface:
The memory mapped network interface design can be further improved by replacing the buer by a set of registers. The transmission of a message requires
several store operations to the memory mapped network interface buer. Similarly, the receiving of a message requires several load operations. By using
on-chip registers, we can eliminate many of these store and load operations by
mapping the interface into the processor's register le. Hence, an arrived message could be stored in a predetermined set of registers. Similarly, the data of
an outgoing message could be stored directly into other predened set of general
registers. Since the registers can be accessed in a lower latency than accessing
the memory buer, the mapping of the processor-network interface into the
register le can achieve low-overhead and high-bandwidth communication.
4. Hardwired Interface
In this scheme, hardware mechanisms are used to bind the send and receive
of messages as well as the interpretation of incoming messages. This scheme
is usually adopted in systems that are built around shared memory and/or
5.3 HOST NETWORK INTERFACE
189
dataow models rather than the general message passing model. This scheme is
not suitable for a general message passing model because the user and compiler
have not control of the process of sending and receiving messages. This is
process is xed and done completely in hardware and thus can be changed to
optimize performance.
5.3.2 Host-Network Interface Examples
A Communication Acceleration Block (CAB)
A host-network interface architecture, Communication Acceleration Block (CAB),
is designed by a group of researchers at Carnegie Mellon University. The goals of
the CAB design are to minimize data copies, reduce host interrupts, support DMA
and hardware checksumming, and control network access. This host architecture is
applicable to a variety of computer architectures (supercomptuers, special purpose
parallel comptuers, iWARP, and workstations).
Registers
Host
Bus
Int.
Host
Bus
Int.
MAC
Network
SDMA
Check
summ
Memory
Registers
MDMA
MAC
Network
SDMA
Memory
MDMA
Check
summ
N
e
t
w
o
r
k
N
e
t
w
o
r
k
Figure 5.14: Block diagram generic network interface
Figure 5.14 shows a block diagram of the CAB architecture. The CAB consists
mainly of two subsystems: transmit and receive. The network memory in each subsystem is used for message buering and can be implemented using Video RAM
190CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
(VRAM). System Direct Memory Access (SDMA) is used handle data transfer between the main memory and network memory, whereas the media DMA (MDMA)
handles the transfer of data between the network media and the network memory.
Since TCP and UDP protocols place the checksum in the packet header, the checksum
of a transmitted packet is calculated when the data is written into network memory
and then placed in the header by the CAB in a location that is specied by the
host as part of the SDMA request. Similarly, the checksum of an incoming packet is
calculated when the data ows from the network into network memory.
To o-load the host from controlling the access to the network medium, the CAB
hardware performs the Medium Access Control (MAC) of the supported network
under the control of the host operating system. It is based on multiple \logical
channels," queues of packets with dierent destinations. The CAB attempts to send
a packet from each queue in a round-robin fashion. The exact MAC is controlled by
the host through retry frequency and time-out parameters for each logical channels.
The register les on both the transmit and receive subsystems are used to queue host
requests and return tags. The host interface implements the bus protocol for the
specic host. The transmit and receive subsystems can either have their own bus
interface or they can share a bus interface.
Nectar Network
Nectar is a high-speed ber-optic network developed at Carnegie Mellon University as
a network backplane to support distributed and heterogeneous computing 34, 35, 36].
The Nectar system consists of a set of host computers connected in an arbitrary
mesh via crossbar switches (hubs). Each host uses a communication processor (CAB:
Communication Accelerator Board) as its interface to the Nectar network. A CAB is
connected to a hub using a pair of unidirectional ber optic links. The network can
be extended arbitrarily by using multiple hubs, where the hubs can be interconnected
in any desired topology using ber optic pairs identical to those used for CAB{
hub interconnections. The network supports circuit switching, packet switching, and
multicast communication through its 100 Mbps optical links.
There are three major blocks of the CAB: processing unit, network interface, and
host interface as shown in Figure 5.15. All three blocks have high-speed access to
a packet memory. The processing unit consists of a processor, program memory,
and supporting logic such as timers. The network interface consists of ber optic
data links, queues for buering data streams, DMA channels for transmission and
reception, and associated control and status logic. The host interface logic includes
Figure 5.15: Block diagram of the Nectar CAB
slave ports for the host to access the CAB, DMA controller and other logic. The
packet memory has sucient bandwidth capability to support the accesses by the
three major blocks to the packet memory at high bandwidth to keep up with the
speed of the ber. The accesses to the packet memory by the dierent blocks are
arbitrated at each cycle using an ecient robin-mechanism to avoid conicts.
The CAB design provides exible and programmable features the Nectar network.
The Nectar network can be used as a conventional high-speed LAN by treating the
CAB as a network device. The CAB can be used as a protocol processor by ooading
transport protocol processing from the host processor. Various transport protocols
implemented including TCP and TP4. Another feature of the CAB is that the application interface is provided by a programming library called Nectarine so that part
of application code can be executed on the CAB.
192CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
5.3.3 A tightly coupled processor-network Interface
The approach presented here developed by a group of researchers at Massachusetts
Institute of Technology and is based on user-level register-level mapped interface
30]. In this approach, the most frequent operations such as dispatching, forwarding,
replying and testing for boundary conditions are done by hardware means and by
mapping network interface into the processor's register le. This approach is suitable
for short messages and may not be ecient to handle large message sizes.
Figure 5.16 shows the programmer's view of the interface that consists of 15
interface registers together with an input message queue and an output message
queue. The interface uses ve output registers to send messages, o0 through o4, ve
input registers to receive messages, i0 through i4, the CONTROL register contains
values to control the operation of the network interface, the STATUS bits indicate
the current status of the network interface, and the remaining registers are used for
optimizing message dispatch. Input and output queues are used to buer messages
being received or transmitted, respectively. The message is typically assumed to be
short and consists of ve word, mo through m4 and a 4-bit type eld for optimization.
The logical address of the destination processor is specied by the high bits of the
rst word. The network interface is controlled by SEND and NEXT commands.
The SEND command queues messages from output registers into the output queue
whereas the NEXT command stores messages from the input queue into the input
registers. The network interface, together with the network, enforces ow control
at the sending processor. If the message is long and can not t into ve words, the
architecture can be extended to to send and receive variable length messages by using
the input and output registers as scrolling windows. To achieve this, two commands:
SCROLL-IN for incoming message and SCROLL-OUT for sending message are used.
The interface design provides some support to handle in a privileged manner some
important messages (e.g., those are destined to the operating system). This is done
by allowing the incoming privileged message to interrupt the host or to be stored in
a privileged memory location until the host is free to process that message.
The performance of the basic architecture shown in Figure 5.16 can be further
improved by applying several renements: 1) use a 4-bit message type identier
instead of using 32-bit identier 2) avoid the copying overhead of copying the common
parts of the message elds that will be used in replying to or forwarding part of a
message. This is done through the use two special modes of the SEND command,
REPLY and FORWARD. The SEND command composes an outgoing message using
certain input registers in place of certain output registers, thus removing the need
5.3 HOST NETWORK INTERFACE
193
Processor
Interface
Interface
Registers
to/from
Proc.
o0
o1
o2
o3
o4
i0
i1
i2
i3
i4
Control
Status
IpBase
MsgIP
NextMsgIP
Network
Output
...
to
Network
Message Output
Queue
Network
Input
...
Message input
Queue
from
Network
Figure 5.16: User-level register mapped interface.
to copy and 3) To compute the instruction address of the handler for the incoming
message, MsgIp register (See Figure 5.16) precomputes in hardware. To compute
MsgIp, the network interface replaces certain bits of the IpBase register with the
type bits of the arrived message. Another register NextMsgIp overlaps the processing
of one message with the dispatching of the next message. It computes the handler
address for the next message, just as the MsgIp for current message.
This interface design can be implemented in several dierent ways:
O-Chip Cache-Based Implementation This implementation maps the net-
works interface into a second level o-chip cache (see Figure 5.17 (a)). Thus
this interface becomes another data cache chip on the processor's external data
cache bus. This interface is easy to implement it does not require modications
of the processor chip. But it is slower than an on-chip interface.
On-Chip Cache-Based Implementation This implementation is identical to the
previous one, except that the network interface sits on an internal data cache
bus rather than an external one (see Figure 5.17 (b)). Although the network
interface is added into the processor chip, it does not modify the processor core,
and it only communicates with the processor via internal cache bus. Network
194CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
net interface
net
reg
2nd
level
cache
2nd
level
cache
processor
chip
processor chip
core
cache
reg file
ALU
cache bus
(a) An off-chip cache based implementation of the network interface
processor chip
net
core
net
reg
reg
file
cache
ALU
cache bus
(c) A register-file based implementation of the network
interface.
cache bus
(b) An on-chip cache based implementation of the network interface
Figure 5.17: Three implementations of the network interface.
5.3 HOST NETWORK INTERFACE
195
interface accesses are somewhat faster because the network interface is on-chip.
Register-File-Based Implementation The network interface registers take up
part of the processor's register le and can be accessed as any other scalar
register. The network interface commands are encoded into any of the unused
bits of every triadic (three-register) instruction. It may take no additional cycles
to access up to three network interface registers and to send commands to the
network interface. Thus this interface is the most ecient out of the considered
interfaces.
5.3.4 ATOMIC Host Interface
ATOMIC is a point-to-point interface that supports Gbps data rate developed at
USC/Information Sciences Institute ?]. The goal of this design is to develop LANs
based on MOSAIC technology ?] that supports ne-grain, message-passing, massively parallel computation. Each MOSAIC chip is capable of routing variable length
packets as a fast and smart switching element, while providing added value through
simultaneous computing and buering.
A MOSAIC - C Chip
The architecture of a MOSAIC - C chip is illustrated in Figure 5.18. Each MOSAIC chip has a processor and associated memory. This processing capability can be
utilized to lter messages, execute protocols, and arrange that data be delivered in
the form expected by an application or virtual device specication. It communicates
over eight external channels, four in the X-direction (east, west) and four in the Ydirection (south, north). All eight channels may be active simultaneously. Unless a
MOSAIC - C node is the source or destination for a message, messages pass through
its Asynchronous Router logic on their way to other nodes without interrupting the
processor. When a node is either source or destination, packet data is transferred by
a DMA controller in the packet Interface.
The MOSAIC chip acts like switching element in ATOMIC and is interconnected
in a multi-star conguration (see Figure 5.19). In this design, a message sent from
node 1 to node 7 does not interfere with messages sent from node 4 to node 6 nor
node 5 to node 2. Neither does the message sent from node 8 to node 9 interfere with
that message.
196CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
640 Mb/s channels
2 Kbyte
ROM
64 Kbyte
RAM
14MIP
Processor
Packet
Interface
Asynchronous
Router
Figure 5.18: MOSAIC - C Processor Chip.
7
1
2
8
3
4
5
9
Figure 5.19: Message transfer in MOSAIC channels
6
5.3 HOST NETWORK INTERFACE
197
The MOSAIC host-interface board is versatile, multi-purpose, circuit board that
houses a number of MOSAIC nodes with MOSAIC channels on one side and a
microprocessor-bus interface on the other side. It is shown in Figure 5.20.
Memoryless
Mosaic
Ribbon Cables
Interface Logic
SBus
Memory
Memoryless
Mosaic
Figure 5.20: SBus Host Interface Board by Memoryless MOSAIC chips.
The purpose of the host-interface board are as follows:
To verify the operation of the MOSAIC processor, router, and packet interface,
fabricated together in memoryless MOSAIC chip.
To provide a software-development platform. The bus interface on the boards
presents the MOSAIC memory as a bank of ordinary RAM. Device-driver and
support libraries allow low-level access to the host-interface boards by user programs on the host, which maps the MOSAIC memory into its internal address
space.
To serve as the interface between hosts and MOSAIC node arrays, as shown in
Figure 5.21.
The ribbon-cable connectors in the Figure 5.20 are wired to selected channels on
the memoryless MOSAOC chips. Programs running on the memoryless MOSAIC
198CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Host Interface
Host Interface
Mosaic Array
Figure 5.21: Standard connection of MOSAIC host interface and arrays.
nodes communicate with the MOSAIC node array by message passing through the
ribbon-cable connectors, and with the host computer by shared memory through
the bus interface. Host-interface boards can also be chained together using ribbon
cables, forming a linear array of MOSAIC nodes, and the basic for using MOSAIC
components to build LAN.
ATOMIC
ATOMIC has attributes not commonly seen in current LANs:
Hosts do not have absolute addresses. Packets are source routed, relative to
senders' positions. At least one host process is an Address Consultant (AC)
which can provide a source route to all the hosts on that LAN by mapping IP
addresses to source routes.
ATOMIC consists of multiple interconnected clusters of hosts.
There are many alternate routes between a source and a destination. This exibility exploited by an AC provides bandwidth guarantees or minimizes switch
congestion for high bandwidth ows, and allows load balancing across a cluster.
5.3 HOST NETWORK INTERFACE
199
Since ATOMIC trac ows do not interfere with each other unless they share
links, aggregate performance of the entire network is limited only by its conguration.
Each MOSAIC processor allows the network itself to perform complex functions
such as encryption or protocol conversion.
Topological exibility and programmability make ATOMIC suitable for any
application. ATOMIC supports the IP protocol and therefore all the communication protocols above it such as UDP, TCP, ICMP, TELNET, FTP, SMTP,
etc.
An example of ATOMIC based on MOSAIC chip is ATOMIC and external LAN
illustrated in Figure 5.22.
To other cluster
To other cluster
Netstations
Figure 5.22: Netstation LAN Topology.
200CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Host-Interface Processor (HIP)
HIP is a communication processor developed at Syracuse University ?] capable of
operating in two modes of operation such that either or both of these modes can
be active at a given time. HIP is a master/slave multiprocessor system to run both
standard and non-standard protocols. In the High-Speed Mode (HSM), the HIP
provides applications with data rate close to that oered by network medium. This
high speed transfer rate is achieved by using a high speed communication protocol
(e.g., HCP ?]). Figure 5.23 shows the block diagram of the main functional units of
the proposed HIP.
The HIP design consists of ve major subsystems: a Master Processing Unit
(MPU), a Transfer Engine Unit (TEU), a crossbar switch, and two Receive/ Transmit
units (RTU-1, RTU-2). The architecture of HIP is highly parallel and uses hardware
multiplicity and pipeline techniques to achieve high-performance transfer rates. For
example, the two RTUs can be congured to transmit and/or receive data over highspeed channels while the TEU is transferring data to/from the host. In what follows,
we describe the main tasks to be carried out by each subsystem.
5.3.5 Master Processing Unit (MPU)
The HIP is a master/slave multiprocessor system where the MPU controls and manages all the activities of the HIP subsystems. The Common Memory(CM) is a dualport shared memory and can be accessed by the host through the host standard bus.
Furthermore, this memory is used to store control programs that run on MPU. The
MPU runs the software that provides an environment in which two modes of operation can be supported (HSM and NSM), and also executes several parallel activities
(receive/transmit from/to the host, receive and/or transmit over the D-net, and receive/transmit over the normal network). The main tasks of the MPU are outlined
as follows.
HIP manager: This involves conguring the subsystems to operate in certain
conguration, allocating and deallocating processes to HIP processors (RTUs).
Furthermore, for the NSM, the HIP manager assigns one RTU to receive and/or
transmit over the normal-speed channel in order to maintain the compatibility
with the standard network and to reduce the communication latency.
HLAN manager: This involves setting up a cluster of computers to cooperate in a
distributed manner to execute a compute-intensive application using the D-net
Figure 5.23: Block diagram of HIP
202CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
dedicated to the HSM of operation. The computer that owns this application
will use the D-net to distribute the decomposed subtasks over the involved
computers, synchronize their computations, and collect the results from these
computers.
Synchronizer: This involves arranging the execution order of HIP processes on
RTUs and TEU such that their asynchronous parallel executions will not result
in erroneous results and any deadlock scenarios.
HLAN general management: This involves collecting information about the network activities to guarantee certain performance and reliability requirements.
These tasks are related to network conguration, performance management,
fault detection and recovery, accounting, security and load balancing.
Transfer Engine Unit (TEU)
The communication between the host and HIP is based on a request/reply model.
The host is separated from controlling all aspects of the communication process. The
initiation of data transfer is done by the host through the Common Memory (CM)
and the completion of transfer is notied through an interrupt to the host. The
TEU can be implemented simply as a Direct Memory Access Controller (DMAC).
A similar protocol to that used in the VMP network adapter board 37] can be
adapted to transport messages between the host and HIP. For example, to transmit
data, the host initiates a request by writing a Control Block (CB) in the CM of the
MPU. The CB contains pointers to the data to be sent, destination addresses and
the type of transmission mode (HSM or NSM). The MPU then sets up the TEU
and the crossbar switch, which involves selecting one of the two Host-to-Network
memory (HNM) modules available in each RTU. When the data is being written in
the selected HNM, the RTP of that unit can start the process of transmitting data
over the supported channels according to the type of transmission. Similar activities
are performed to receive data from the network and delivers it to the host.
Switch
This is a 22 maximum connections among the TEU, MPU, and RTUs. The use
of local buses in MPU and RTUs allow any component of these subsystems to be
accessed directly through the switch.
5.3 HOST NETWORK INTERFACE
203
Receive/Transmit Unit (RTU)
The main task of the RTU is to ooad the host from getting involved in the process
of transmitting/receiving data over the two channels. At any given time, the RTU
can be involved in several asynchronous parallel activities: (1) receive and/or transmit data over the normal-speed channel according to standard protocols (2) receive
and/or transmit data over the D-net or the S-net according to a high-speed protocol.
Furthermore, the packet pipeline reduces signicantly the packet processing latency
during the high-speed mode and consequently increases the overall system throughput. Otherwise, the tasks of decoding, encoding, data encryption, and checksumming
should be done by the RTP, which would adversely aect the performance of the
RTU.
Host-to-ATM-Network Interface (AIB)
The host-to-ATM-network interface board implements the functions of the ATM layer
and AAL layer. The communication protocol model for ATM network consists of
physical layer, ATM layer, AAL layer, and upper layers. The physical layer is based
on SONET transmission standard. The ATM layer and AAL layer together are equivalent to the data link layer in the ISO model ?]. But the ATM switches implement
routing functions that belong to the network layer. The upper layers above the AAL
represent the protocol stack of the user and control applications, and they could
implement the functions of TCP/IP as well as other standard protocols.
The functions of the ATM layer are multiplexing/demultiplexing cells of dierent
VPI/VCIs, relaying cells at intermediate switches, and transporting cells between
two AAL peers. The ATM layer also implements ow control functions. But the
ATM layer doesn't perform error control. The purpose of the AAL is to provide
those capabilities necessary to meet the user layer data transfer requirement while
using the service of the ATM layer. This protocol provides the transport of variable
length frames (up to 65535 bytes in length) with error detection. The AAL is further
divided into two sublayers: the segmentation and reassembly (SAR) sublayer and the
convergence sublayer (CS) 22]. The SAR sublayer performs the segmentation of data
packets associated with the higher layers into ATM cells at the transmission side and
the inverse operation at the receiving side. The CS is service-dependent, and it could
perform functions like message identication, etc.
204CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Implementation of ATM Layer and AAL Layer
Recently, there has been an increased interest in ATM technology and designing
ATM host interfaces. Davie at Bellcore described an ATM interface design for AURORA Testbed environment in 23]. The interface connects DEC 5000 workstations
to SONET. Traw and Smith at University of Pennsylvania designed a host interface
for ATM networks 24] which connects IBM RS/6000 workstations to SONET. These
two designs connect two classes of workstations to the AURORA Testbed by utilizing
the high speed I/O busses and DMA mechanism. But if the network speed increases,
the data access from host's memory could become bottleneck. Assuming the memory access time is 100ns, and the memory bandwidth is 32 bits, the memory access
rate is 320 Mbits/second. It can not support networks with speed greater than 320
Mbits/second.
To solve this problem, we use the concept of direct cache access and allow the
network interface to directly access the cache during sending and receiving of messages. This idea has been used in designing a host-to-network interface for a parallel
computer, called MSC (Message Controller) that is used with AP1000 (an MIMD
machine) 25]. The MSC works as a cache controller when there is no networking activities. During the message passing scenarios, the MSC directs data from the cache
to the network interface. But it doesn't transfer data from the network interface to
the cache to avoid unnecessary cache updating.
The method of direct cache access makes the host and network interface more
tightly coupled, and thus avoids waiting for data to be updated in main memory. For
some systems using write back policy ( for instance, IBM RS/6000s ), this approach
could be meaningful. Importantly, copying data from cache is much faster than from
main memory. Assume that the cache access time is 10ns, which is 10 times faster
than memory access time, and that the bandwidth of cache controller is 32 bits, then
the data transfer rate can accommodate a network speed of 3.2 Gbits/second. Even
considering the cache contention between the host's CPU and the network interface,
the cache miss ratio, and other cache access overhead, the data transfer rate can
still match the STS-12 rate (622 Mbps). With high speed transmission lines and
high speed networking protocols, it is possible to have remote data access latency
comparable to that experienced in main memory data access. Then the network-based
distributed system will be a cost-eective high-performance alternative for parallel
computing environment.
To perform the functions associated with the ATM layer and AAL layer, the AIB
is deigned to communicate with the upper layer protocols eciently through a shared
Figure 5.24: Block Diagram of Interface Between AIB and Host
memory, and to move data between host's cache/memory and network in the format of
53-byte ATM cells. Figure 5.24 shows the block diagram of the AIB and its connection
with the host. The AIB consists of two units, Message Transmitting Unit (MTU) and
Message Receiving Unit (MRU). Each message is segmented into xed size cells (53
bytes each) or reassembled from a series of ATM cells. As we discussed in the previous
section, VPI/VCI and PTI together carry the routing information. The intermediate
switches along a source-to-destination path relay cells to their destinations based on
the information carried in cell headers. If a switch recognizes a certain cell is destined
to its associated cluster, the switch will direct the cell to the MRU of that node. The
MRU will dispatch the received cells to some message handling processes according
to the VCI carried in the header.
For existing computers, it is common that a CPU is connected to CMMUs (Cache
Memory Management Unit) through a Processor Bus Interface. Then the CMMUs
are interfaced to main memory through M-Bus (memory bus). Within a CMMU,
there are basically two components, one is the cache, another is the MMU (Memory
Management Unit) which is responsible for address translation and input or output
data to or from the processor and the main memory. In our design, we require the
M-Bus to be connected to the network interface.
In the AIB, both the MTU and the MRU have a DM/CA controller that is used
to move data between network and host. The DM/CA controller is connected to the
M-Bus. The DM/CA controller can be considered to have two parts according to
its functionalities. The rst part is associated with the main memory and performs
DMA mechanism. The second part is associated with the cache and implements the
the DCA (Direct Cache Access) mechanism.
206CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
On the transmitting side, the DCA controller communicates with the MMUs and
requests the data from cache. When the DCA controller gets a cache hit, it will
move data from the CMMU to the network FIFOs. If a cache miss occurs, the
DMA controller will transfer data from the main memory to the network FIFOs. On
the receiving side, the DCA controller writes data into cache if the addressed line
is already in cache otherwise, the DMA mechanism is initialized to store data into
main memory. The DM/CA is not allowed to cause cache update.
Because both the host's CPU and the network interface want to access the cache,
it might cause contention over the cache control and the M-Bus, and this could
lead to performance degradation. But this contention is expected to be low because
microprocessor systems are designed with the multiple level memory hierarchy of
register le, cache, main memory and disk.
Message Transmitting Unit
Figure 5.25 shows the block diagram of the MTU that is responsible for processing
message transmission. The upper layers communicate with the AIB software layers
(AAL and ATM) based on request/response communication model. In this model,
the host writes its command (control block request) into a shared memory model
(SM). Upon receiving the commands, the Transmitting Processor (T-Processor) will
perform relevant message sending functions associated with the AAL protocol. It will
initialize the DM/CA by supplying the operation to be performed (read operation),
the memory address, and the number of bytes to transfer. In this ATM interface, for
each message to be sent, the T-Processor commands the DM/CA controller to move
the next ATM cell body from the designated memory area. The T-Processor loads
the DM/CA instructions (commands) in the Command Buer (see Figure 5.25) that
will be fetched by the DM/CA controller.
While the DM/CA controller moves data out, the T-Processor concurrently computes the header for the cells and puts the header in the header FIFO. The rst
4 bytes of the header are composed by the upper layer protocols and the network
management components. The fth byte which is the checksum is calculated by the
T-Processor according to the rst 4 bytes. These 5 bytes are written into the header
FIFO. Cell bodies are stored in the cell body FIFO. For each pair of cell header and
cell body, there is a Cell Composer that concatenates these two parts and delivers the
completed cells to an STS-3c Framer. The Framer will put the cells in a frame based
on SONET standards and transmit the frame over the network. The STS-3c Framer
provides the functions of SONET Transmission Convergence (TC) Sub-layer. For the
Figure 5.25: AIB Structure
SONET STS-3c structure, this Framer provides cell transport at 149.76Mbps and an
information payload transport rate of 135.632 Mbps. The Framer has an 8 bit wide
data input and a number of control signals that indicate when data is required and
when it is time to provide the start of a new cell 23]. The FIFO controllers will read
the data out from the FIFOs and send them to the STS-3c Framer. If cell sequence
number is necessary, the Cell Composer will insert the sequence number for each cell.
In this case, each cell body is 44 bytes.
208CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
5.3.6 Message Receiving Unit
The MRU is a microprocessor based system that has a Receiving Processor (RProcessor), DM/CA controller and other hardware circuitry. To process arriving
messages eciently, we design the MRU to be a message driven receiver. In other
words, the operation of the MRU is not controlled by the host, but by the received
messages. This concept is referred to as \active message dispatching". The active
message dispatching is that messages are dispatched to the corresponding message
handling processes by extracting the information carried in the cell header. The VCI
eld is mapped to the thread associated with a requested service. Based on the
thread, the data portion of a message is transferred from the network FIFOs into
the host's cache or memory. The data will be sent to cache only if the addressed
line is already in the cache. If the addressed line is not in the cache, it means that
the line is unlikely to be used recently. So the data should be sent to main memory.
This scheme prevents the MRU from updating the cache to cause unnecessary cache
misses. The host will be notied when data transfer is completed.
The active message dispatching mechanism avoids waiting for a matching READ
issued from the host. The data is processed and stored while other computations
are performed. There are several advantages of this approach. First, it can overlap
computations and communications, because the arrived messages can be transferred
into the host's memory area without interrupting the host CPU. Secondly, active
message dispatching doesn't involve the host's operating system, because message
processing is performed oine using the AIB. Consequently, message passing based
on active message dispatching will result in low latency communications.
The architecture of the MRU is shown in Figure 5.25. Upon receiving ATM cells
from the STS-3c Framer, a Cell Splitter will separate a cell header from its body. At
rst, the HEC eld in the header will be checked by a hardware component, HEC
Checker. If an error is detected, the corresponding message is dropped. The HEC
Checker will then report to the R-Processor about this error. The R-Processor will
notify the host in its turn about this error. The upper layer protocols implemented
in the host system will request a retransmission. The ATM layer protocol is not responsible for error recovery or retransmission. If the message can pass HEC checking,
the cell sequence will be reassembled into an AAL-PDU. Then the R-Processor will
command some hardware circuitry to further perform the AAL protocol processing,
such as checking the length of the message and the CRC eld in the AAL-PDU (as
explained in Section 3), If the CRC calculation doesn't result in a correct polynomial
function, it means that the message is corrupted somehow. An Error Corrector will
5.3 HOST NETWORK INTERFACE
209
try to correct the error. If the attempt of error correction fails, the message is dropped
and the host is notied. Another case is that some cells are missing during transmission. The length of message will not match with the value in the length eld. Again,
the message is dropped and a notication is sent to the host. The communication
between the host and the MRU is via the SM. Upon receiving information from the
MRU, the SM controller will notify the host by either interrupt or polling schemes.
If the message is received error free, the R-Processor will further process the message
according to the header information.
For ow control messages, it is necessary for the MRU to communicate with the
MTU, because the MTU is the one that controls cell transmission. The MTU can
stop cell transmission when congestion occurs. If the arrived message carries a request
for the value of a variable in a certain distributed application, the MRU will let the
MTU fetch the value and transmit it back to the remote waiting process. Because the
MTU is responsible for fetching data from the host's cache/memory and transmitting
the value to the remote node. The connectivity between the MTU and MRU further
ooads the host of the communication task, as the AIB can handle communication
jobs more independently.
210CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
5.4 Hardware Implementations of Standard Transport Protocols
The emergence of new high-speed networks has shifted the bottleneck from the network's bandwidth to the processing of the communication protocols. There has
been an increased interest in the development of new communication subsystems
that are capable of utilizing high-speed networks. The current implementations of
the standard communication protocols are ecient when networks operate at several
kbps. However, the performance of these implementations cannot match the bandwidth of the high-speed networks that operate at megabit or gigabit speeds. One
research approach is to improve the implementation of the communication subsystems that are based on the existing standard protocols68, 69, 70]. This approach
focuses on o-loading the host from processing communication protocols, and uses
external microprocessor-based systems to do this task. Parallelism in communication
protocols is also viewed as means to boost the performance of the communication
subsystems. In this section, we analyze the performance of dierent TCP/IP implementations and compare their performance. The simulation tool OPNET79] is
used in modeling and analyzing all the implementation techniques discussed in this
section.
5.4.1 Base Model: TCP/IP Runs on the Host CPU
It has been largely agreed upon, that protocol implementation plays a signicant role
in the performance of the communication subsystems62, 66]. One needs to analyze
the various factors that contribute to the delay associated with the data transmission
in order to identify ecient techniques to improve the implementations of standard
protocols. The analysis presented in this section, uses a system that consists of two
nodes, each running at 15 MIPS. Nodes communicate with each other using a pointto-point communication link with a bandwidth of 500 Mbps. Each node runs TCP/IP,
and has a running user process. One of the nodes transmits a series of segments to
the other node. Segment generation is modeled as a Poisson distribution. Each node
executes approximately 300 instructions of fast path TCP/IP62]. It is also assumed
that these 300 instructions are mapped to 400 RISC instructions.
In this implementation TCP/IP runs on the host CPU as shown in Figure 5.26.
A user process sends a message by making a system call to invoke the communication
process (TCP). The host CPU will then make a context switching between the user
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS211
Host CPU
Host Memory
User Space
User Process
TCP
Kernel Space
IP
Host
Network Interface
Network
Controller
Network Buffer
To Network
Figure 5.26: TCP/IP Run on Host CPU
process and the communication process72]. Data to be sent is then copied from the
user memory space to the kernel memory space. The TCP/IP processes are then
executed to generate segments that encapsulate the outgoing data. A cut-through
memory management scheme is assumed. This will minimize the overhead of data
copying. In this scheme, instead of moving the whole segment between TCP and IP,
only the reference of the segment is passed63]. Segments ready to be sent are then
copied to the network buer. Memory used in this analysis is assumed to have an
access time of 60 ns per word (a 32-bit word). The checksumming overhead is 50
ns/byte. Note that in this model, the host CPU is involved in both data processing
and copying. Upon arrival, segments will be transferred from network buer to the
kernel space of the host memory. Since the host CPU is used to move data, two
memory cycles are needed for every transferred word. After processing the incoming
segment by TCP/IP, data is then transferred from the kernel space to the user space.
The complete sequence of events in transmitting and receiving segments is illustrated
in Figure 5.27.
For each sent (or received) segment, the CPU will make a context switch between
the user process and the communication process. Based on the analysis presented
in64, 65], we can estimate the context switching overhead to be approximately 150
212CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Transmitter
User Process
Tc
Context Switching
U/K Transfer
1
Tu
Tp
TCP/IP
Ts
Checksumming
K/N Transfer
2
Tn
Tt
Receiver
User Process
Rc
Context Switching
Ru
K/U Transfer
Rp
TCP/IP
Rs
Checksumming
Rn
N/K Transfer
1 User/Kernel Transfer
COPY = Tu + Tn + Ru + Rn
CONTEXT = Tc + Rc
2 Kernel/Network Transfer
CHECKSUM = Ts + Rs
TCP/IP = Tp + Rp
TRANS = Tt
Figure 5.27: Application-to-application latency
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS213
microseconds.
The end-to-end delay associated with segment transmission can be divided into
three classes of factors:
Per byte delay factors :
1. The data copying, \COPY".
2. The checksumming, \CHECKSUM".
3. The transmission media, \TRANS".
Per segment delay factors :
1. The protocol processing, \TCP/IP".
2. The context switching, \CONTEXT".
Per connection delay factor : This factor is negligible for large stream-oriented
data transfer.
Figure 5.28 shows the contribution of these delay factors to the overall segment
transmission time, the cumulative end-to-end delay is illustrated for dierent segment
sizes.
Figure 5.29 presents the throughput achieved for dierent segment sizes. For
a segment size of 4096 bytes, the estimated throughput is around 50 Mbps. With
network bandwidth of 500 Mbps, the eective bandwidth is around 10% of the network
bandwidth. Therefore, this model is inecient when the network operates at several
hundreds of Mbps. As it is shown in Figure 5.28, the data copying, checksumming,
and context switching delays are the main bottlenecks in this model.
5.4.2 TCP/IP Runs Outside the Host
The context switch overhead can be drastically reduced (if not eliminated), if the protocols are running on a dedicated processor outside the host CPU. Data copying can
also be decreased by using DMA instead of the host CPU. O-loading the processing
of communication protocols from the host CPU will increase the CPU time available
to the user processes.
Figure ?? shows a block diagram of a front-end processor that runs TCP/IP and is
attached via a high-speed interface to the host. Data is copied between the network
214CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
C
u
m
u
l
a
t
vi
e
D
e
al
y
(msec)
1.4
1.2
1
0.8
0.6
3+
3+
0.4
2
0.2 2 04 4
512 1024
3+
2
4
3+
2
3+
TRANS 3
CONTEXT +
2
COPY
2
CHECKSUM 4
4 TCP/IP 4
2048
3072
4096
Segment Size (byte)
Figure 5.28: Cumulative Delay vs. Segment Size
50
45
T 40
h
r
ou 35
g
hp 30
u 25
3
t 20
(Mbps)
15 3
10
512 1024
3
3
3
2048
3072
4096
Segment Size (byte)
Figure 5.29: Throughput vs. Segment Size
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS215
Host
CPU
Host
Buffer
Host
Protocol Processor
Protocol Processor
TCP
Checksum
Controller
Protocol
Buffer
IP
DMA
Controller
Network Interface
Network
Buffer
Network
Controller
To Network
Figure 5.30: TCP/IP-based Communication Subsystem
and the protocol buers using DMA. DMA is also used to copy data between the
protocol and host buers. Checksumming is done by an additional VLSI circuitry on
the communication processor. The checksum is computed while data is being copied
between the protocol and network buers. This conguration leads to a signicant
improvement over the base model as it is illustrated in Figures 5.31 and 5.32. For a
segment size of 4096 bytes, the throughput is around 220 Mbps in contrast to only
50 Mbps using the base model. Also, Figure 5.31 shows that the data copying factor
increases with the increase of the segment size, and becomes the main bottleneck.
5.4.3 Interleaved Memory Communication Subsystem
Here, an ecient and simple concept is used to reduce the data copying delay. The approach is based on the idea of memory interleaving to implement the subsystem's
buer. In this approach, the memory is partitioned into a number of independent
modules. There are several congurations of memory interleaving, the one used in
this model is called low-order interleaving. In this conguration, the low order bits
of the memory address are used to select the memory module while the higher order
bits are used to select (read/write) the data within a particular module. Assume the
memory is partitioned into k modules, address lines a0 - am;1 (k = 2m) are used to
216CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
0.4
C
u
m 0.35
u 0.3
l
a 0.25
t
vi 0.2
e
D 0.15
3
e
la 0.13 +
y 0.05 +
(msec) 2 2
0
512 1024
3
3
3
+
2
2
+
TRANS
3
COPY +
+
2
TCP/IP
2
2048
3072
4096
Segment Size (byte)
Figure 5.31: Cumulative Delay vs. Segment Size
220
200
T
h 180
r
ou
g 160
hp 140
3
u
t 120
(Mbps)
1003
80
512 1024
3
3
3
2048
3072
4096
Segment Size (byte)
Figure 5.32: Throughput vs. Segment Size
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS217
Data Bus
Memory
Controller
Module 0
R/W
Module 1
........
Module 2
Module 7
a3an-1
a2
Decoa1
a0
der
.
.
.
Figure 5.33: Interleaved Memory System
select a particular module, while the remaining address lines (am - an;1 ) are used to
select the data word within the module. The k modules, each with a size of 2n;m ,
gives a total memory size of 2n words. Figure ?? shows an 8-way interleaved memory
system.
Let ta be the memory access time, then the total time (Tsum ) needed to access l
words in this memory system is :
Tsum = ta + ((l ; 1) tka )
For large l,
Tsum ' (l ; 1) tka
Then, the average access time for this memory system Taccess would be,
ta
Taccess = Tsum
'
l
k
And the bandwidth is :
words=s
BW = tk
a
Figure ?? illustrates the sequence of memory accesses in the interleaved memory
system.
218CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Access 7
M7
Access 6
M6
Access 5
M5
Access 4
M4
Access 3
M3
Access 11
Access 2
M2
Access 10
Access 1
M1
Access 9
Access 0
Access 8
M0
M0 M1 M2 M3 M4 M5 M6 M7 M0 M1 M2 M3
Time
t
a
t /8
a
Figure 5.34: Memory Accesses Sequence
Performance Analysis
In this model, both the network and protocol buers are congured as an 8-way
interleaved memory, as shown in Figure ??. The main improvement lies in the data
copying overhead incurred in decreasing segment transfer delay between the network
and protocol buers. The time needed in this operation is approximately 1/8 the time
needed in the previous model. The elapsed time to move data between the protocol
and host buers remains the same as that of the previous model. Figures 5.36, 5.37
show the estimated cumulative end-to-end delay and throughput for this model. The
achievable throughput (see Figure 5.37) is approximately 340 Mbps using a segment
size of 4096 bytes. On the other hand, for a segment size of 512 bytes, the throughput
achieved is around 115 Mbps.
5.4.4 TCP/IP Runs on A Multiprocessor System
Applying parallelism in designing communication subsystems is an important approach to achieve the high-performance needed in today's distributed computing environment. Zitterbart61] discussed the dierent levels and types of parallelism that
are typically applied to communication subsystems design. We adopt a hybrid par-
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS219
Host
CPU
Host
Buffer
Host
Protocol Processor
Protocol
Buffer
Protocol Processor
Checksum
Controller
*
TCP
DMA
Controller
IP
Network
Buffer
Network Interface
*
Network
Controller
* 8-way Interleaved Memory
To Network
Figure 5.35: Interleaved Memory Communication Subsystem
0.3
C
u 0.25
m
u
l
a 0.2
t
vi 0.15
e
D 0.1
3+
e
la
3+
y 0.05 2 2
(msec)
0
512 1024
3
3
3
+
2
2
+
TRANS
+
3
COPY +
2
TCP/IP
2
2048
3072
4096
Segment Size (byte)
Figure 5.36: Cumulative Delay vs. Segment Size
220CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
350
T 300
hr
o
u 250
g
h
p 200
u
t
(Mbps) 150
3
3
3
3
3
100
512 1024
2048
3072
4096
Segment Size (byte)
Figure 5.37: Throughput vs. Segment Size
allelism approach, which is based on layer and packet levels. In layer parallelism,
dierent layers of the hierarchical protocol layers are executed in parallel, where in
packet parallelism, a pool of processing units is used to process incoming (and outgoing) packets concurrently.
5.4.5 Parallel Architectural Design
The parallel implementation approach presented in this section is similar to that
discussed in66]. In the design shown in Figure ??, we use a processor (IP proc)
to handle the IP processing, and four transport processors (proc 1, proc 2,
proc 3, and proc 4) are used to handle the TCP processing. On the arrival
of a segment, IP proc executes the IP. Then, one of the transport processors is
selected according to a round robin scheduling policy, to run the TCP for the arrived
segment. Therefore, multiple segments can be processed concurrently using dierent
transport processors. Since the IP processing is approximately 20% of the total
TCP/IP processing time62], four processors are utilized to run the TCP and one
processor to run the IP (see Figure ??). The other modules of this design are :
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS221
Host Buffer
Host Processor
Host
DMA
Controller
S_mem
Proc
Proc 1
Shared Mem
Proc 2
Proc 3
*
Proc. Buffer
Proc 4
*
Proc. Buffer
*
*
Proc. Buffer
Proc. Buffer
*
Proc.
Buffer
Proc
Buffer
IP Proc
DMA
Controller
Protocol Processor
Network Interface
*
* 8-way interleaved memory
Network Buffer
Network Controller
To Network
Figure 5.38: Parallel Communication Subsystem
222CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
seg 4
Proc 4
...............
|
seg 3
Proc 3
.. . . . . . . . . . . . . . .
|
seg 6
seg 2
Proc 2
|
seg 5
seg 1
|
Proc 1
IP Proc
seg 1
.........
|
|
............
|
|
seg 2
|
seg 3
|
seg 4
|
seg 5
|
seg 6
|
...........
T
Time
Figure 5.39: The Sequence of Segments Processing
Shared Memory : This memory block is shared by the transport processors
to keeps the context records of the established connections.
Shared Memory Processor (S mem Proc) : This processor has two main
tasks: shared memory management, and acknowledgment of the received segments.
DMA Controllers : DMA controllers are used to move data between the
dierent buers, segments transfer between the network buer and the IP proc
buer, segments transfer between the IP proc buer and the transport processor
buer, and data transfer between the transport processor buer and the host
buer.
Context Records A stream of transmitted data segments has an inherited se-
quential ordering structure. For this reason, some of the connection state variables
should be stored in the shared memory module. In TCP, we identify the elements of
the Transmission Control Block (TCB) that should be shared between the transport
processors. These elements are kept in a context record that is maintained in the
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS223
shared memory for each established connection. Each context record consists of the
following elds (see Figure ??):
Source, Destination port number : Used as an ID to the established
connection.
Receive next: The sequence number of the next expected segment. This
variable is used to generate the acknowledgment for the received segments.
Send unacked : The highest sequence number of segments that are sent but
not acknowledged yet. The acknowledgment eld of the arrived segments is
used to update this variable.
Send next : The sequence number of the next segment to be transmitted.
Local window information : Corresponds to the reserved space allocated
to this connection.
Remote window information.
Each line in the subsequent entries in the context record shown in Figure ??
correspond to a received segment. They consists of three elds: a pointer to the
starting address of the segment, the segment size, and the initial sequence number of
the segment. For each segment arrival, the transport processor accesses the shared
memory to manipulate the connection state variables according to the information
pending in the arrived segment. It also adds a line that contains the three elds
mentioned before. S buf Proc periodically accesses the context record of the established connections, to look for contiguous blocks in the received segments. It updates
\receive next" accordingly, and appends the required acknowledgment to a segment
in the reverse direction. If there is no such a segment, it will generate a separate
acknowledgment.
Due to the existence of this shared resource, a conict exists between the transport
processors in accessing this shared memory. If more than one processor tries to access
the shared memory (for a write operation), only one is granted and the others should
wait.
224CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
Source Port Number
Dest. Port Number
Send_Next
Send_Unacked
Receive_Next
Remote Window
Local Window Information
Start address
pointer
Start address
pointer
.
.
.
.
.
.
Size
Seq_num
Size
Seq_num
.
.
.
.
.
.
.
.
.
.
.
.
Figure 5.40: Context Records
Performance Analysis
In order to analyze the throughput oered by the parallel TCP/IP implementation, an
estimate of the contention overhead caused by accessing the shared memory must be
computed. The approach used here is similar to that presented in 6]. Assume that the
processing cycle is T (as shown in Figure ??), which is approximately 240 instructions
(the number of instructions used to process TCP). About 15 instructions are used
to access the shared memory for the processing of every arrived segment. With 4
processors, 60 (4 15) instructions of shared memory accesses, or 25% (60/240)
of T. Therefore only 25% of the shared memory accesses are expected to undergo
contention. The contention penalty is then estimated to be four more instructions.
Figure 5.41 shows the throughput achieved using this implementation approach
for dierent segment sizes. For a segment size of 4096 bytes, the throughput achieved
is approximately 375 Mbps. On the other hand, for a segment size of 512 bytes,
the throughput is around 195 Mbps. Since this approach increases the performance
by parallelizing the TCP processing, it is more eective in the case where the TCP
processing time plays a more signicant role in the overall end-to-end delay, and that
is when segment sizes are small.
5.4 HARDWARE IMPLEMENTATIONS OF STANDARD TRANSPORT PROTOCOLS225
380
360
T 340
hr 320
o
u 300
g
h 280
3
p 260
u 240
t
(Mbps) 220
2003
180
512 1024
3
3
3
2048
3072
4096
Segment Size (byte)
Figure 5.41: Throughput vs. Segment Size
5.4.6 Discussion
In this section a performance analysis of four dierent TCP/IP implementation models has been presented. A simulation model has been built to estimate the performance
measures. We have also identied the contribution of the dierent delay factors in
the overall end-to-end data transmission delay. We have introduced an ecient approach to implement the communication subsystem's buer based on the memory
interleaving concept. We have also introduced a parallel TCP/IP implementation. In
this approach, multiple processors are employed to process segments concurrently.
Figure 5.42 shows the estimated throughput for the four models discussed in this
section. In the memory interleaved communication subsystem (the third model), the
estimated throughput is around 340 Mbps using a segment size of 4096 bytes, that
corresponds to a throughput of 220 Mbps in the second implementation model, or an
increase of 55%. On the other hand, for a segment size of 512 bytes, the achieved
throughput is around 115 Mbps, in contrast to 98 Mbps using the second model, which
corresponds to an increase of only 17%. This indicates that the interleaved memory
communication subsystem approach provides a signicant performance improvement
for larger segment sizes. Using the parallel implementation approach (the fourth
226CHAPTER 5 ARCHITECTURAL SUPPORT FOR HIGH-SPEED COMMUNICATIONS
400
350
T 300
hr
3
o
u 250
g
h 2003 +
p 150
2
u
t 100 +
(Mbps) 2
50
0
512 1024
3
+
2
2048
3
3
MODEL-4
3
2
2
MODEL-2
2
MODEL-1 +
+ MODEL-3 +
3072
4096
Segment Size (Byte)
Figure 5.42: Throughput vs. Segment Size
model), the throughput achieved is approximately 375 Mbps for a segment size of 4096
bytes. That is an increase of only 10% over the interleaved memory implementation
approach. On the other hand, for a segment size of 512 bytes, the throughput is
around 195 Mbps, that represents an increase of about 70%.
Therefore, as the segment size increases, the throughput dierence between the
interleaved memory approach and the parallel implementation approach diminishes
(see Figure 5.42). On the other hand, as the segment size increases, the throughput
dierence between the interleaved memory approach and the second approach grows.
In general, using the interleaved memory concept to implement the communication
subsystem's buer can signicantly improve the data copying delay, which is the main
bottleneck in the overall end-to-end data transmission delay for stream oriented, bulk
data transfers where large segment sizes are used.
Bibliography
1] E. Biagioni, E. Cooper, R. Sansom, "Designing a practical ATM LAN," IEEE
Network, vol.7, no. 2. pp.32-39.
2] Many, "Network Compatible ATM for Local Network Applications - Phase 1
V1.01," 19 Oct 1992.
3] D. J. Greaves, D. McAuley, "Private ATM netorks," IFIP Transactions C (Communication Systems), vol.C-9, p.171-181.
4] G.J. Armitage. K.M. Adams, "Prototyping an ATM adaptation Layer in a multimedia terminal," International Journal of Digital and Analog Communication
Systems, vol.6, no. 1. p.3-14.
5] Fouad A. Tobagi, Timothy Kwok, "Fast Packet Switch Architectures and the
Tandem Banyan Switching Fabric," High-Capacity Local and Metropolitan
Area Networks, p. 311-344.
6] Fouad A. Tobagi, "Fast Packet Switch Architectures for Broadband Integrated
Services Digital Networks," Procedings of the IEEE, vol. 78, no. 1, p. 133-167,
January 1990.
7] Jean-Yves Le Boudec, "The Asynchronous Transfer Mode: a tutorial," ?????.
8] H. Ahmadi, et al., "A high performance switch fabric for integrated circuit and
packet switching," in Proceedings of INFOCOM'88, New Orleans, LA, March
1988, pp. 9-18.
9] C. P. Kruskal and M. Snir, "The performance of multistage interconnection
networks for multiprocessors," IEEE Trans. Computers, vol. C-32, no. 12, pp.
1091-1098, Dec. 1983.
227
228
BIBLIOGRAPHY
10] M. Kumar and J. R. Jump, "Performance of unbuered shue-exchange networks," IEEE Trans. Computers, vol. C-35, no. 6, pp. 573-577, June 1986.
11] V. E. Benes, "Optimal rearrangeable multistage connecting networks," Bell
Systems Technical Journal, vol. 43, no. 7, pp. 1641-1656, July 1964
12] K. E. Batcher, "Sorting Networks and their applications," in AFIPS proceedings
of the 1968 Spring Joint Computer Conference, vol. 32, pp. 307-314
13] H. Suzuki, H. Nagano, T. Suzuki, T. Takeuchi, S. Iwasaki, "Output-buer
Switch Architecture for Asynchronous Transfer Mode," in Proceedings of the international conference on Communications, Boston, MA, June 1989, pp. 4.1.1.4.1.5.
14] Y. Yeh, M. G. Hluchyj, A. S. Acampora, "The Knockout Switch: A Simple,
Modular Architecture for High-Performance Packet Switching," in IEEE Journal on Selected Areas in Communications, vol. SAC-5, no. 8, October 1987.
15] Kanakia,H. and Cheriton,D.R. \The VMP Network Adapter Board (NAB) High
Performance Network Communication for Multiprocessors," Proc. of the SIGCOMM `88 Symp. on Communications Architectures and Protocols, pp. 175187. ACM, August 1988
16] Arnould,E.A, et al., \The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers," Proc. Third Conf. on Architectural Support for
Programming Languages and Operating Systems, Boston, MA, April 1989
17] Kung,H.T.et al., \Network-Based Multicomputers: An Emerging Parallel architecture, " Supercomputing Conf., November 1991
18] Finn,G. \An Integration of Network Communication With Workstation Architecture, " ACM Computer Communication Review, Vol.21, No.5, October 1991
19] Kung,H.T. \Gigabit Local Area Networks: A System Perspective," IEEE Communication Magazine, pp. 79-89, April 1992
20] Dally,W.J. and Seitz,C.L. \Deadlock-free Message Routing in Multiprocessor
Interconnection Networks" Computer Science Department, California Institute
of Technical Report 5231:TR:86, 1986
BIBLIOGRAPHY
229
21] Sullivan,H. et al., \A Large Scale Homogeneous Machine", Proc. 4th Annual
Symposium on Computer Architecture, 1977, pp 105-124
22] Cheung,N.K. \The Infrastructure for Gigabit Computer Networks," IEEE Communication Magazine, April, 1992, pp 60-68
23] Davie,B.S. \A Host-Network Interface Architecture for ATM," Proc. ACM SIGCOMM '91, Zurich, September 1991
24] Traw,C.B.S. and Smith,J.M \A High Performance Host Interface for ATM Networks," Proc. ACM SIGCOMM '91, Zurich, September 1991
25] Shimizu,T.,et al., \Low-latency Message Communication Support for the
AP1000," 1992 ACM, pp. 288-297
26] Giacopelli,J.N.,et al., \Sunshine: A High-Performance Self-Routing Broadband
Packet Switch Architecture," IEEE Journal on Selected Areas in Communications, Vol.9, No.8, October 1991 pp. 1289-1298
27] Akata,M.,et al., \A 250 MHz 32 32 COMS Crosspoint LSI with ATM Switching Function," NEC Research Journal, 1991.
28] Pattavina,A.,\Multichannel Bandwidth Allocation in a Broadband Packet
Switch," IEEE Journal on Selected Areas in Communications, Dec.,1988,
pp.1489-1499.
29] Peter A. Steenkiste, et al, \A Host Interface Architecture for High-Speed Networks," High Performance Networking, IV (C-14), 1993 IFIP, pp. 31-46.
30] Dana S. Henry and Christopher F. Joerg, \A Tightly-Coupled ProcessorNetwork Interface," ASPLOS V, 1992 ACM, ?
31] T. F. La Porta and M. Schwartz,\ Architectures, Features, and Implementation
of High-Speed Transport Protocols," IEEE Network Magazine, pp. 14{22, May
1991.
32] Z. Haas,\ A Communication Architecture for High-Speed Networking," IEEE
INFOCOM, San Francisco, pp.433{441, June 1990.
230
BIBLIOGRAPHY
33] A. Tantawy, H. Hanafy, M. E. Zarky and Gowthami Rajendran, \ Toward A
High Speed MAN Architecture," IEEE International Conference on Communication (ICC'89), pp. 619{624, 1989.
34] O. Menzilcioglu and S. Schilck, \ Nectar CAB: A High-Speed Network Processor," Proceedings of International Conference on Distributed Systems, pp.
508{515, July 1991.
35] E. C. Cooper, P. A. Steenkiste, R. D. Sansom, and B. D. Zill, \ Protocol
Implementation on the Nectar Communication Processor," Proceedings of the
SIGCOMM Symposium on Communications Architecture and Protocols, pp.
135{144, August 1990.
36] H.T. Kung et al., \ Parallelizing a New Class of Large Applications over Highspeed Networks," Third ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming, pp. 167{177, April 1991.
37] H. Kanakia and D. R. Cheriton, \ The VMP Network Adapter Board: Highperformance Network Communication for Multiprocessors," Proceedings of the
SIGCOMM Symposium on Communications Architectures and Protocols, pp.
175{187, August 1988.
38] D. R. Cheriton and C. L. Williamson, \ VMTP as the Transport Layer for
High-Performance Distributed Systems," IEEE Communication Magazine, pp.
37{44, June 1989.
39] G. Chesson, \ The Protocol Engine Design," Proceedings of the Summer 1987
USENIX Conference, pp. 209{215, November 1987.
40] XTP 3.4, \ Xpress Transfer Protocol Denition - 1986 Revision 3,4," Protocol
Engines Inc., July 1989.
41] W. T. Strayer, B. J. Dempsey and A. C. Weaver, XTP: The Xpress Transfer
Protocol, Addison Wesley, 1992.
42] M. S. Atkins, S. T. Chanson and J. B. Robinson, \ LNTP - An Ecient Transport Protocol For Local Area Networks," Proceedings of Globecom'88, pp. 2241{
2246, 1988.
BIBLIOGRAPHY
231
43] W. S. Lai,\ Protocols for High-Speed Networking," Proceedings of the IEEE
INFOCOM, San Francisco, pp. 1268{1269, June 1990.
44] D. D. Clark, M. L. Lambert and L. Zhang,\ NETBLT: A High Throughput
Protocol," Proceedings of SIGCOMM'87, Computer communications review,
Vol. 17, No. 5, pp. 353{359, 1987.
45] N. Jain, M. Schwartz and T. R. Bashkow,\ Transport Protocol Processing at
GBPS Rates," Proceedings of the SIGCOMM Symposium on Communications
Architecture and Protocols, pp. 188{198, August 1990.
46] D.D. Clark and D. L. Tannenhouse,\ Architectural Considerations for a New
Generation of Protocols," Proceedings of the ACM SIGCOMM Symposium on
Communications Architecture and Protocols, pp. 200{208, September 1990.
47] M. Zitterbart,\ High-Speed Transport Components," IEEE Network Magazine,
pp. 54{63, January 1991.
48] A. N. Netravali, W. D. Roome, and K. Sabnani,\ Design and Implementation
of a High-Speed Transport Protocol," IEEE Trans. on Communications, pp.
2010{2024, November 1990.
49] A. S. Tanenbaum, Computer Networks, 2nd Edition, Prentice-Hall, 1988.
50] M. Schwartz, Telecommunication Networks: Protocols, Modelling and Analysis,
Addison Wesley, 1988.
51] D. D. Clark et al. ,\ An Analysis of TCP Processing Overhead," IEEE Communications magazine, pp. 23{29, June 1989.
52] S. Heatley and D. Stokesberry, \ Analysis of Transport Measurements Over
Local Area Network," IEEE Communications Magazine, pp. 16{22, June 1989.
53] L. Zhang,\ Why TCP Timers Don't Work Well," Proceedings of the ACM SIGCOMM Symposium on Communications Architecture and Protocols, pp. 397{
405, 1986.
54] C. Partridge,\ How Slow Is One Gigabit Per Second? " Computer Communication Review, Vol. 20, No. 1, pp. 44{52, January 1990.
232
BIBLIOGRAPHY
55] W. A. Doeringer et al., \ A Survey of Light-Weight Transport Protocols for
High-Speed Networks," IEEE trans. Communications, Vol. 38, No. 11, pp.
2025{2039, November 1990.
56] H. T. Kung, \ Gigabit Local Area Networks: A Systems Perspective," IEEE
Communications Magazine, pp. 79{89, April 1992.
57] I. Richer,\ Gigabit Network Applications," Proceedings of the IEEE INFOCOM
San Francisco, p. 329, June 1990.
58] C. E. Catlett,\ In search of Gigabit Applications," IEEE Communications Magazine, pp. 42{51, April 1992.
59] J. S. Turner, \ Why We Need Gigabit Networks," Proceedings of the IEEE
INFOCOM, pp. 98{99, 1989.
60] E. Biagioni, E. Cooper and R. Sansom, \ Designing a Practical ATM LAN,"
IEEE Network, pp. 32{39, March 1993.
61] M. Zitterbart, \Parallelism in Communication Subsystems," IBM Research Report RC 18327, Sep. 1992.
62] D. Clark, V. Jacobson, J. Romkey, and H. Salwen, \An Analysis of TCP Processing Overhead," IEEE Communications Magazine, Vol. 27, pp. 23-29, June
1989.
63] C.M. Woodside and J.R. Montealegre, \The Eect of Buering Strategies on
Protocol Execution Performance," IEEE Trans. on Communications, Vol. 37,
pp. 545-553, June 1989.
64] E. Maa and B. Bhargava, \Communication Facilities for Distributed
Transaction-Processing Systems," IEEE Computer, pp. 61-66, August 1991.
65] M. S. Atkins, S.T. Chanson, and J.B.Robinson, \LNTP-An Ecient Transport
Protocol for Local Area Networks," Proceedings of Globecom'88, Vol. 2, pp.
705-710, 1988.
66] N.Jain, M. Schwartz, and T. Bashkow, \Transport Protocol Processing at GBPS
Rates," Proceedings of ACM SIGCOMM'90, pp. 188-199, August 1990.
BIBLIOGRAPHY
233
67] H. Kanakia and D. Cheriton, \The VMP Network Adaptor Board (NAB):
High-Performance Network Communication for Multiprocessors," Proceedings
of ACM SIGCOMM'88, pp. 175- 187, August 1988.
68] E. Rutsche and M. Kaiserswerth, \TCP/IP on the Parallel Protocol Engine,"
Proccedings of IFIP fourth International Conference on High Performance Networking, pp. 119-134, Dec. 1992.
69] K. Maly, S. Khanna, R. Mukkamala, C. Overstreet, R. Yerraballi, E. Foudriat,
and B. Madan, \Parallel TCP/IP for Multiprocessor Workstations," Proccedings of IFIP fourth International Conference on High Performance Networking,
pp. 103-118, Dec. 1992.
70] O.G. Koufopavlou, A. Tantawy, and M. Zitterbart, \Analysis of TCP/IP for
High Performance Parallel Implementations," Proceedings of 17th Conference
on Local Computer Networks, pp. 576-585, Sep. 1992.
71] K. Hwang and F. Briggs, Computer Architecture and Parallel Processing,
McGraw-Hill 1984.
72] D.E. Comer and D.L. Stevents, Internetworking with TCP/IP, Vol I, Design,
Implementation, and Internals, Prentice- Hall, 1991.
73] R.M. Sanders and A. Weaver, \The XTP Transfer Protocol (XTP) - A Tutorial," ACM Computer Communication Review, Vol. 20, pp. 67-80, Oct. 1990.
74] D.D. Clark, M. Lambert, and L. Zhang, \NETBLT: A High Throughput Transport Protocol," Proceedings of ACM SIGCOMM'87, pp. 353-359, August 1987.
75] W.A. Doeringer, D. Dykeman, M. Kaiserswerth, B. Meister, H. Rudin, and R.
Williamson, \A Survey of Light- Weight Transport Protocols for High-Speed
Networks," IEEE trans. on Communications, Vol. 38, pp. 2025-2039, Nov.
1990.
76] T.F. La Porta and M. Schwartz, \Architectures, Features, and Implementation
of High-Speed Transport Protocols," IEEE Network Magazine, pp. 14-22, May
1991.
77] H. Meleis and D. Serpanos, \Designing Communication Subsystems for HighSpeed Networks," IEEE Network Magazine, pp. 40-46, July 1992.
234
BIBLIOGRAPHY
78] T. La Porta and M. Schwartz, "A High-Speed Protocol Parallel Implementation:
Design and Analysis," Proccedings of IFIP fourth International Conference on
High Performance Networking, pp. 135-150, Dec. 1992.
79] A. Cohen et al., OPNET Modeling Manual, MIL 3, Inc., 1993.
Chapter 7
Remote Procedure Calls
279
280
CHAPTER 7. REMOTE PROCEDURE CALLS
7.1 Introduction
Remote Procedure Calls (RPC) are one paradigm for expressing control and data
transfers in a distributed system. As the name implies, a remote procedure call
invokes a procedure on a machine that is remote from where the call originates. In
its basic form, an RPC identies the procedure to be executed, the machine it is to
be executed on, and arguments required. The application of this RPC model results
in a client/server arrangement where the client is the application that issued the call
and the server is the processor that handles the call.
The RPC RPC mechanism diers from the general IPC model in that the processes are typically on remote machines. RPC oers a simple means for programmers
to write software that utilizes system wide resources (including processing power)
without having to deal with the tedious details of network communication. It oers
the programmer an enhanced and very powerful version of the most basic programming paradigm, the procedure call. Each RPC system has ve main components
that include compile-time support, binding protocol, transport protocol, control protocol, and data representation[3]. In order to be able to implement a reliable and
ecient RPC mechanism, there are several issues that the designer must address in
each component. These issues are the semantics of a call in the presence of a computer or communication failure, the semantics of address-containing arguments in the
absence of shared memory space, integration of remote procedure calls into existing
programming system, binding, server manipulation, transport protocols for the transfer of data and control between the caller and the callee, data integrity and security,
and error handling[1]. This chapter addresses the RPC mechanism and the requirements to produce an ecient, easy to use, and semantically transparent mechanism
for distributed
Draft: v.1, April 4, 1994
7.2. RPC BASIC STEPS
281
7.2 RPC Basic Steps
RPC may be viewed as a special case of the general message passing model of interprocess communication[?]. The message based IPC involves a process sending a
message to another process on another machine, but it does not necessarily need to
be sychronized on either the sender or the receiver process. Sychronization is an
important aspect of RPC since the mechanism models the local procedure call. PRC
passes parameters in one direction to the process that will execute the procedure, the
calling process is blocked until execution is complete, and the results are returned
to the calling process. Each RPC system strives to produce a mechanism that is
syntatically and semantically transparent to the dierent languages that are being
used to develop distributed applications. A remote call is syntatically transparent
when its syntax is exactly the same as the local one and a remote call is semantically
transparent when the semantics of the call are the same as the local. Syntactically
transparent is achievable, but total semantic transparency is a challenging task[?].
When a remote procedure call is executed, the steps that the call takes are illustrated in Figure 7.1[?]. The user process makes a normal local procedure call which
invokes a client-stub procedure that corresponds to the actual procedure call. The
client-stub packs the calling parameters into a message and passes it to the transport
protocol. The transport protocol transmits the message to the server using the communication network. The server's transport protocol receives the message and passes
it to the server-stub that corresponds to the procedure that is being invoked. The
server-stub unpacks the calling parameters and then calls the procedure that needs
to be executed. Once execution is complete, the response message is packed with the
results returned by the procedure and then sent back to the client via the server's
transport protocol. The message is received by the client's transport protocol that
passes it to the client-stub to unpack the results and then return control to the local
call with the results in the parameters of the procedure call[?].
In the process described above, the crucial point to be appreciated that despite the
occurrence of a large number of complicated activities (most of which have been shown
in simplied fashion), these are entirely transparent to the client procedure which
invokes the remotely executed procedure as if it were a local one. Contrast this to a
client who must perform a remote function without the use of RPCs. Such a program
would minimally have to worry about the syntax and implementation details of issues
like creating and communicating using sockets, ensuring that such communication
is reliable, performing error handling, worrying about network byte order during
transmission, breaking up voluminous data into smaller sizes for transmission on
account of limitations of data transmission using sockets etc. Such a comparison shows
the importance of RPC to a programmer wishing to develop distributed applications
in a simpler more convenient and less error-prone fashion.
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
282
Caller machine
User
User-stub
local
call
pack
argument
Caller machine
Network
RPCRuntime
transmit
RPCRuntime
Call packet
receive
Server-stub
Server
unpack
argument
call
wait
local
return
unpack
result
importer
receive
work
Result packet
transmit
exporter
pack
result
return
importer
interface
exporter
interface
Figure 7.1: Main steps of an RPC
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
283
7.3 RPC Design Issues
7.3.1 Client and Server Stubs
One of the integral parts of the RPC mechanism is the client and server stubs. The
client and server stub's purpose is to manipulate the data contained in a request and
response messages so that it is suitable for the transmission over the network and for
use by the receiving process. The process of taking the arguments and results of a
procedure call, assembling them for transmission across the network, and disassembing them on arrival is known as marshaling[4]. In a remote call, the client will pack
the calling arguments into a format suitable for transmission and unpack the result
parameters in the response message after the complete execution of the procedure.
The server will unpack the parameter list of the request message and pack the result
parameters in the response message.
7.3.2 Data Representation
The architecture of the machine on which the remote procedure call is executed need
not necessarily be the same as that of the client's machine. This may mean that
the two machines support dierent data representations such as ASCII or EBCDIC
etc. Even if both machines have the same data representations they may dier in the
byte lengths for representing dierent data structures. For example an integer may
be represented as four bytes on the client and eight bytes on the server. To further
complicate matters, some machines use dierent byte ordering for data structures.
To overcome the above mentioned incompatibilities in data representation on different machines, most RPC systems dene a special data representation of their own
which is used when transmitting data structures between machines during a remote
call or when returning from one. When making a remote call, the client must convert
the procedure arguments into the special data representation and the server decodes
the incoming data into the data representation that is locally supported by the host.
Since both client and server both understand the same special data representation,
the above problem is solved. An example of special data representation is the Sun
XDR (External Data Representation).
The data representation may employ either implicit or explicit typing. In implicit
typing, only the value of a data element is transmitted across the network, not the
type. This approach is used by most of the major RPC systems such as Sun RPC,
Xerox Courier and Apollo RPC. The explicity typing is the ISO technique for data
representation (ASN.1 (Abstract Syntax Notation I)). With this representation the
type of each data eld is transmitted along with its value. The encoded type also
contains the length of the value being sent. The disadvantage of such an approach is
Draft: v.1, April 4, 1994
284
CHAPTER 7. REMOTE PROCEDURE CALLS
the overhead spent in the decoding.
There are three dieret types for representing data in RPC systems: Network
Type, Client Type, and Server Type.
Network Type: In this type, the entity that is sending the data converts the data
from the local format to a network format and the receiving entity converts the
network format into its local format. This approach is attractive when we try
to realize heterogeneous RPC systems. If the number of dierent architectures
on the network is large and varied, this is the logical choice when designing the
system. The disadvantage is that a change of format has to be performed twice
for each transmission.
Client Type: In this type, the client transmits the data across the network in
its own local format. However it also sends a short encoded eld (typically a
byte) which indicates its own data format. The server must then convert the
received data by calling an appropriate conversion routine based on the encoded
information of the client's data representation. The advantage of this approach
is that the process of conversion of data representation now occurs only once (at
the server end), instead of twice as in the network type data representation. The
disadvantage of this approach is that the server must now be provided with the
capability of being able to convert a variety of dierent data representations into
its own. Whenever a client machine with a new data representation is added
to the network, all the servers must be modied and provided an additional
routine to be able to convert the new data representation of the new machine
to their own.
Server Type: In this approach the client is required to know the data representation
format of the server and to convert the data to the server's representation
before transmitting the data. In this way only one conversion is done (at the
client end). The client can determine the server's data representation if this
information can be procured from the binder. For this to be possible, the server
must inform the binder of its data representation when it registers itself with
the binder.
7.3.3 RPC Interface
Stubs may be generated either manually or automatically. If generated manually, it
becomes the server programmer's responsibility to write separate code to act as a stub.
The programmer is in a position to handle relatively complex parameter passing fairly
easily. Automatically generated stubs require the existence of a parameter description
language that we refer to an Interface Description Language (IDL). In eect IDL is
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
Stub
object code
link
Stub
source code
compile
Programmer
285
Client/server
code
(a)
Programmer
Stub
source code
compile
processor
link
Stub
object code
desc. lang.
parameter
Client/server
interface*
Client/server
code
(b)
* The interface is written in the parameter description language
Figure 7.2: Creation of stubs. (a) manually, (b) automatically
used to dene the client-server procedure call interface. Interface denitions are then
processed to produce the source code for the stubs. The stubs code can then be
compiled and linked (as in manual generation) to the client or server code. These are
diagrammatically shown in Figure 7.2.
The interface language provides a number of scalar types, such as integers, reals,
Booleans, characters, and facilities to dene structure. If the RPC system is part
of a language that does support interface procedure denition, such as Cedar RPC,
an interface language is not needed[4]. When RPC is to be added to languages with
no interface denition capabilities like C or Pascal, an interface language is needed.
Examples include the Amoeba Intertace Language (AlL, used in the Amoeba System),
and rpcgen (used in the Sun RPC system).
These denitions in the interface language can be compiled into a number of different languages enabling clients and servers written in dierent languages to communicat using RPC via the interface compiler. When the interface language is compiled
with the client or server application code, a stub procedure is generated for each
procedure in the interface dention. The interface language should support as many
languages as possible. Also, the interface compiler is the basis used for the integration
of remote procedure calls into existing programming languages. When a RPC mechanism is built into a particular language, it is required that both client and server
applications use the same language which limits the exibility and expandability of
dierent and new services in the system.
Draft: v.1, April 4, 1994
286
CHAPTER 7. REMOTE PROCEDURE CALLS
7.3.4 Binding
An important design issue of an RPC facility is how a client determines the location
and identity of the server. Without the location of the server, the client's remote
procedure call cannot take place. We refer to this as binding. There are two aspects
of the binding. First, the client must locate a server that will execute the procedure
even when the server location changes[2]. Second, to ensure that the client and server
RPC interfaces were compiled from the same interface[2]. There are two approaches
to implement the binding mechanism.
1. Compile the address of the server into the client code at compile-time. This
approach is not practical since the location of resources could change at any
time in the system due to server failures or the server just being moved. If any
of these conditions occur, the clinets would having to be recompiled with new
server address. This approach allows no exibility in RPC systems.
2. Use of a binder to locate the server address for each client at run-time. The
binder is usually resident on a server and contains a table of the names and locations of all currently supported services, together with a checksum to identify
the version of the RPC interface used at the time of the export. Also an instance
identier will be needed to dierentiate identical servers that export the same
RPC interfaces. When a server starts executing, its interface is exported to the
binder with the information specied above to uniqely identify its procedures.
Also, the server must be able withdraw their own instances by informing the
binder. When the clients start executing, it can be bound to an instance of a
server by making a request to the binder for particular procedures. This request
will contain a checksum of the parameters it expects so the binder can ensure
the client and server were built with the same RPC interface denition. This
request only needs to be made once, not for every call to the procedure. The
only other time the client should make any binding request is if it detects a
crash of one of the servers it uses. In addition to any functional requirements,
the design of the binder must be robust against failures and should not become
a performance bottleneck. One solution to this issue was used by the Cedar
RPC system. It involved distributing the binder among several servers and
replicating information across the servers[1]. Consequently, the binding mechanism of the system did not rely totally on one server's operations. This allows
the binder to continue operation when a server that contains a binder carshes.
The one drawback of the binder spread across multiple servers is the added
complexity needed to control these binders.
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
287
7.3.5 Semantics of Remote Calls
One main goal of RPC systems is to provide a transparent mechanism to access
remote services. Semantic transparency is an important design consideration in RPC
systems. Syntatic transparency can be achieved by using the interface denition and
stub generators, but semantic transparency in two areas have to be addressed[2].
Call Semantics
Call semantics determine how the remote procedure is to be performed in the presence
of a machine or communication failure. There are four reasons that lead to the failure
of a remote procedure once its initiated: the call request message can be lost, the
server can crash, the response message can be lost, and the client can crash. Under
normal circumstances, we would expect a remote call to result in the execution of the
procedure exactly once; that is the semantics of a local call. However, whenever network communication is involved it is not rare to encounter abnormal circumstances.
As a result the ideal semantics of local procedure calls are not always realized. RPC
systems typically may have three dierent types of call semantics.
1. At Least Once: This implies that the remote procedure executed at least
once, but possibly more than once. This semantics will prevent the client from
waiting for time-out indenitely since it has has a time-out mechanism on the
completion of the call. To guarantee that the RPC is executed, the client will
repeatedly send the call request to the server whenever the timeout is reached,
until a reply message is received or it sure that the server has crashed. This
type of call semantics is used in some systems mainly because it is simple to
implement. However, this semantics has the drawback of having the procedure
being executed more than one time. This property makes this type of call
semantics inappropriate to apply to a banking system; if a person is withdrawing
$100 dollars and it is executed several times, the problem is obvious. A possible
solution when using at-least-once semantics is to design an idempotent interface
in which error will not occur if the same procedure is executed several times.
2. At Most Once: This implies that the procedure may be executed once or
not at all, except when the server crashes. If the remote call returns normally
then we conclude that the remote procedure was executed exactly once. If an
abnormal return is made, it is not known whether the procedure executed or
not. The possibility of multiple execution of the procedure, however, is ruled
out. When using this semantics the server needs to keep track of client requests
to be used to detect duplicate requests and must be able to return old data
in response message if the response fails. Therefore, this semantics requires a
Draft: v.1, April 4, 1994
288
CHAPTER 7. REMOTE PROCEDURE CALLS
more complex protocol to be designed. This call semantics is used in Cedar
RPC system[1, 4].
3. Exactly Once: This implies that the procedure is always executed exactly
once even when server crashes[2]. This is idealistic and it is unreasonable to
expect this because of possible server crashes.
Most RPC systems implement and support at-most-once call semantics; some provide
merely at-least-once semantics. The choice of semantics supported aects the kind of
procedures that the applications programmer is allowed to write.
ldempotent procedures: These are procedures that may be executed multiple
number of times instead of just once, without any harm being done or with
the same net eect. Such procedures may be written regardless of what call
semantics are supported.
Non-idempotent procedures: These procedures must be executed only once to
obtain the net eect desired. If executed more than once they produce a dierent
result from that expected (e.g., a procedure to append data to a le). These
procedures may be written if at-most-once semantics are provided by the RPC
system. Therefore, to allow writing of such procedures it is very desirable that
the RPC system support at-most-once semantics.
7.3.6 Parameter Passing
In normal local procedure calls, it is valid and reasonable to pass arguments to procedures either by value or by reference (i.e., by pointers to the values). This is true
as all local procedures share the same address space on the host they are executed
on. However, when we view remote procedure calls, we realize that the address space
of the local and remote procedures is not shared. Hence the passing of arguments by
reference to procedures executing remotely is entirely invalid and pointless. Hence
most RPC systems that use non-shared remote address space insist that arguments
to remote procedures be passed by value only. Only very few RPC systems for closed
systems, where address space is shared by all processes in the system allow the passing
of arguments by reference.
Passing Long Data Elements
Passing large data structures such as large arrays by value may not be possible given
the limitations on packet size for transport level communication. One solution to this
problem is suggested by Wilbur and Bacarisse, wherein dedicated server processes are
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
client
289
server
CALL: pass handle
CALL - BACK
REPLY: results
REPLY: results
Nested call-back for remote procedure. Can be used as a possible solution to
the problem of passing large parameters to remote procedures where the call-back
is to get part of the large parameter of the first client call.
Figure 7.3: Nested call-back for remote procedure
considered which will supply parts of large arguments as and when needed. For this
purpose the client must be able to specify to the remote procedure the identity of
such a server. This may be done by passing a server handle, using which the remote
procedure may use the server (within the client itself) to obtain its large parameters.
This obviously involves nested remote procedure calls, with the server being able to
make a nested remote call to the client7.3.
Passing Parameters by Reference
Although most current RPC systems disallow passing of parameters by reference due
to non-shared address space between client and server, this need not necessarily be
so. In fact if remote procedure calls are to gain popularity with programmers it is
desirable that they be able to imitate local calls as much as possible. Hence it is
very desirable that an RPC system be able to support calls in which parameters are
pointer or references. Sometimes it is possible to approximate the net eect of passing
a pointer parameter to a remote procedure.
Tanenbaum suggests one way of doing this if the client stub is aware of the size of
the data structure that the pointer is pointing to. If this is the case, then the client
can pack the data structure in to the message to be sent to the server. The server stub
then invokes the remote procedure not with the original pointer but a new pointer
which points to the data structure passed in the message. This data structure is then
passed back to the client stub in order that it may know of any changes made to the
Draft: v.1, April 4, 1994
290
CHAPTER 7. REMOTE PROCEDURE CALLS
data structure by the server.
The performance of this operation can be improved if the client stub has even
greater information as to whether the reference is an input or output device for the
remote procedure. If it is an output device the buer pointed to by the pointer need
not be sent at all by the client stub, and instead only the server will pack the data
structure during the return from the server. On the other hand, if the client stub is
aware that the reference is an input parameter only, then it can send that information
to the server stub. The server stub would then refrain from sending the data stmcture
back to the client as it would not have undergone any modication. This eectively
improves the communication performance by a factor of 2.
The above discussion is only valid if the client stub indeed has the required information pertaining to reference parameters. This can be achieved by specifying
the format of the parameters while formally specifying the client stub perhaps using
parameter description language for stubs described earlier.
7.3.7 Server Management
This issue addresses how the servers in the RPC system will managed since it directly
aects the semantics of a procedure call. There are three dierent strategies that can
be used to manage RPC servers: static server, server manager, and stateless server.
Static Server: This approach is the simplest possible approach and it is based
on the idea of having client arbitrarily select a server to use for it's calls[2].
The diculty with this approach is that each server must now maintain state
information concurrently as there is no guarantee that when a client makes a
second call, the same server would be allocated to it as the one for the rst
call. These servers may not be dedicated to a single client, i.e. a server may be
required to serve several clients in an interleaved fashion. This introduces the
additional diculty that each server must maintain concurrently the state for
each client separately. If this were not so sharing of a remote resource between
several clients would become impossible since the dierent clients may attempt
to use the resource in conicting manners.
Server Manager: This approach uses the server manager to select which server
a particular client will interface with[2]. When the client makes a remote call,
the binder instead of returning the address of a server returns the address of a
server manager. When the client calls the server manager, the manager returns
the address of a suitable server which is thereupon dedicated to the client. The
main advantage of using a server manager is that load balancing is not an issue
as each server serves only one client. Also, due to the dedicated servers, each
server now need only to maintain state information for one client.
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
291
Stateless Server This is less common approach to server management in which each
call to the server results in a new server instance[2]. In this case, there is no
state information retained between calls, even from the same client. Therefore,
state information has to be passed to the server in the request message which
adds to the overhead of the call.
In many implementations of RPC system the static server is used since it usually
expensive to produce a server per client in a distributed system. The server must
be designed so that interleaved or current requests can be serviced without aecting
each other. There are two types of delays that a server should be concerned with, the
local and remote delay[2]. The local delay occurs when a process can not execute a
call becuse a resource is being used currently. This degrades the system performance
if the server can not service incoming message requests during this delay. This type
of delay occurs in servers that are designed with a single server process which only
allows a service to be called at a time. Therefore, the designer of an RPC must produce an ecient server implementation for servicing multiple requests concurrently.
One possible solution is to desing the server with one process for receiving incoming
requests from the network and several worker processes that are designated to actually execute the procedure[1, ?]. The distributor and worker processes are referred to
as lightweight processes that all share the same address space. This implementation
of the server is illustrated in Figure 7.4.
The distributor polls for incoming messages in a loop, and puts them into a
queue of message buers. The message buers are implemented using shared memory.
The distributor as well as the worker processes can access them. When a worker
process is free, it extracts an entry from the message queue and starts executing the
required procedure. When done, the worker process replies directly to the calling
client process. It obtains the client's address from the message buer entry. If there
is another pending entry in the queue the worker processes commences on that or else
it temporarily suspends execution.
The queue of message buers must be controlled by a monitor-The functions of the
monitor would typically include providing mutual exclusion on the shared variables,
memory and resources including the message queue. It ensures that there is no conict
in with the distributor putting entries in the message queue and in worker processes
extracting entries from it. It must awaken worker processes that may be suspended
for want of access to a shared resource.
The client distributor associate incoming replies with their associated clients because when a client makes a call, it inserts a unique message ld into the request
message. The server merely copies the unique message Id into the reply message.
The listening process at the client routes the reply to the appropriate client based on
the unique message ld. This scheme therefore requires that clients at a given machine
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
292
Server
Shared Memory
Worker
Worker
Incomming
Message
Message
Buffers
Worker
Distributor
Worker
Worker
Shared
Variables
Figure 7.4: Implementing server with lightweight processes
be able to generate unique identiers amongst themselves.
7.3.8 Transport Protocol
An important design consideration for an RPC system is selecting a suitable transport
protocol for the transfer of data and control between the client and the server. One important criterion is to choose the protocol that minimizes the time between the initializing of a procedure call and the receiving of its results[1]. RPC mechanisms are usually built using datagram communications since the RPC messages are mostly short
and the establishment of a connection is undesirable overhead especially whe local area
networks are reliable[4]. Generally speaking, one can identify three RPC transport
protocols that include the Request protocol, the Request/Reply/Acknowledgement
protocol, and the Request/Reply protocol[4]. The request protocol is used when
there is no value to be returned to the client and the client requires no conrmation of the exchange of the procedure. This protocol supports the maybe semantics and is the least used RPC protocol because of its limited exibility. The Request/Reply/Acknowledgement protocol requires the client to acknowledge the reply
message from the server. This protocol is rather inecient since the acknowledgement really does not add anything to the protocol. Once a client receives the reply
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
293
with the results of the call, the call is complete. This property of RPC makes the
Requst/Reply protocol the most widely used protocol.
Another important feature of the transport protocol is its ability to detect failures and server crashes. One simple mechanism to achieve this is for the protocol to send a probe message to the server during communications that require an
acknowledgement[1]. If the client periodically sends a probe to the server and the
server responds to these probes, the server is operational. If the server does not respond to the probes, the assumption can be made that the server has crashed and
an exception should be raised. Another expansion that the protocol should have is
the ability to send messages larger than one datagram; the protocol should support
a multiple packet transfer mechanism.
Two of the more popular transport protocols in use today and oered on the BSD
4.2 version of the UNIX are the Unreliable Datagram Protocol (UDP) and the Transmission Control Protocol (TCP)[4]. The TCP communication is a connection-based
mechanism which requires the communicating processes to establish a connection between them before transmission occurs. This causes additional communication overhead when transmitting messages. The mechanism is mainly used in RPC system
when extremely high reliability is important, multiple execution of procedures would
cause problems and multiple packet transfers are frequent[5]. On the other hand,
the UDP communication mechanism is a datagram connectionless-based protocol. It
does not need to establish a connection between the communicating processes. UDP
is popular in UNIX based RPC systems in which multiple execution of procedures
will not cause a problem, the servers support multiple clients concurrently, and have
mostly single packet transfers[5].
The RPC system desinger are normally faced with three options to choose a
transport protocol.
1. Use an existing transport protocol (such as TCP or UDP) and an existing
implementation of this protocol.
2. Use an existing transport protocol, but develop an implementation of this protocol which is specialized for the RPC system.
3. Develop an entirely new transport protocol which will serve the special requirements of an RPC system.
The rst approach is the simplest, but it will provide the poorest performance.
The second approach is considered when high performance is the prime issue, the
third option is really the only viable one. The argument put forward by Birrell and
Nelson in support of the last option is the following. Existing protocols such as TCP
are suitable for data transfer involving large chunks of data at a time in byte stream.
Draft: v.1, April 4, 1994
294
CHAPTER 7. REMOTE PROCEDURE CALLS
Based on experiments conducted by them using existing protocols, they conclude that
for short, quick request-response type of communication involved in RPC systems it
is desirable that a new faster protocol be developed specially for the RPC system.
The future trend in the design of RPC systems will predictably be in designing
lightweight transport protocols. Signicant performance improvement can be obtained if the remote call and reply messages can be sent by bypassing the existing
complex protocol at the transport and data link layers and instead implementing simple, light high-speed protocols for message communication as discussed in Chapter
4.
7.3.9 Exception Handling
So far we have been able to model remote procedure calls fairly close to local procedure
calls semantically. However, this is true for as long as no abnormal events occur either
at the client or server end or in the network in-between. The fundamental dierence
between remote calls and local calls becomes apparent when errors occur. There are a
lot of entities, both hardware and software, that interact when a remote call is made.
A fault in any of these can lead to a failure of the remote call. Some possible obvious
errors are discussed below.
Lost remote call request from client stub
The client stub sends out a message to a server for a remote call and the request
is lost possibly due to a failure in the network transmission. This problem can be
solved if the client stub employs a timer and retransmits the request message after a
time-out.
Lost reply from sever stub
The reply message from the server stub to the client stub after successful execution
of the remote procedure may be lost before it reaches the client stub. Reliance
on the client stub's timer to solve this problem is not a reasonable solution if we
consider that the remote call is a non-idempotent procedure. We cannot allow the
procedure to be executed several times for each retransmission from the client before
the client successfully gets a reply from the server. A solution to this problem is for
the client stub to tag each transmission with a sequence number. If a server executes
a procedure and later sees a retransmission with a larger sequence number it refrains
from re-execution the procedure. This approach results in at-most-once semantics for
the RPC system.
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
295
7.3.10 Server crash
The server may crash either before it ever received the request message or it may
crash after having executed the procedure but before replying. The unfortunate part
is that the client has no way of knowing which of the two cases applied. In either
case the client will time-out. This may be approached in two ways. The server may
be allowed either to reboot or an alternate server may be located from the binder
and the request message retransmitted. This results in at-least-once semantics, and is
unsuitable for RPC systems which want to support non-idempotent remote procedure
calls. The second option is for the client stub not to retransmit after a time-out and
merely report failure back to the client. This ensures at-most-once semantics and
non-idempotent procedure calls pose no problem.
Client Crash
It is possible that a client may crash after it has sent a request message for a remote
call but before it gets a reply from it. Such a server computation has been labelled
as an orphan, as it is an active computation with no parent to receive the reply. For
one thing orphans waste processing resources. More importantly, they may tie up
some valuable resources in the server. This is probably the most annoying of all the
possible failures in an RPC system as it is rather dicult to handle in an elegant
and eective manner. Several authors have diering views on how to approach this
problem.
One of the more reasonable approaches suggested by Nelson, is that when the
client reboots after a crash, it should broadcast a message to all other machines on
the network indicating the start of a new "time frame of computation" termed as an
epoch. On receiving an epoch broadcast message each machine attempts to locate
the owners of any remote computations being executed on it. If no owner is found
the orphan is killed.
The problem with killing an orphan, however, is that it creates problems with
respect to allocated resources. For example an orphan may have opened les based
on le handles provided by the client. If these are closed when the orphan is killed,
the question arises what if that le was also being used by some other server also
invoked by the client. Another suggestion made by Shrivastava and Panzieri is that
servers should be atomicised. That is the server either executes from start to nish
with a successful reply to the client or it does not do anything. This all-or-nothing
approach is conceptually good but leads to diculties in its implementation. For
example, the server would be required to maintain information which would enable
it to undo its processing if it discovers that it has become an orphan.
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
296
client
server
CALL: with parameters
End of server execution.
No reply required.
Non-blocking RPC. The client is free to continue execution after a call to remote
procedure that does not return a reply. Can potentially be used as a powerful tool
to develop distributed applications with a high degree of concurrency.
Figure 7.5: Non-blocking RPC
7.3.11 Non-blocking RPC
The RPC model discussed so far has assumed a synchronous blocking mechanism
model. Some RPC systems however do support a non-blocking model, most notably
the Athena project at MIT. Non-blocking RPC is conceptually possible for remote
calls that do not return any reply to the client. This is shown in Figure 7.5.
The client is free to continue execution after a call to remote procedure that does
not return a reply. This feature can be used as a poweul tool to develop distributed
applications with a high degree of concurrency. Once a call to a remote procedure is
made, the client is able to continue further processing concurrently with the execution
of the remote procedure. One way of achieving this is if the client stub that makes the
remote calls is aware of whether the remote procedure will be returning a reply or not.
It is not sucient for the client stub to perform a non-blocking remote call for every
procedure that does not return a reply. It becomes the programmer's responsibility
to specify whether or not a call to a remote procedure should be blocking or nonblocking at time of dening the remote procedure using a procedure interface language
described earlier.
Non-blocking RPC can potentially become a very powerful tool in the development
of distributed applications with a high degree of concurrency. This is particularly true
if it can be combined with the concept of nested remote procedure calls.
Draft: v.1, April 4, 1994
7.3. RPC DESIGN ISSUES
297
7.3.12 Performance of RPC
The performance loss in using RPC is usually of a factor of 10 or even more. However,
as shown by performance indicators from dierent experimental RPC systems, this
depends greatly on the design decisions incorporated into the RPC system. Typically
trade-os need to be made between enhanced reliability and speed of operation when
designing the system. The performance of RPC systems can be analyzed in terms of
the time spent during the call time. Wilbur and Bacarisse present a look at the time
components that go into making a remote call and are as follows:
1. parameter packing
2. transmission queuing
3. network transmission
4. server queuing and scheduling
5. parameter unpacking
6. execution
7. results packing
8. transmission queuing
9. network transmission
10. client scheduling
11. results unpacking
These parameters can be combined into four paramters: parameter transformations, network transmission, execution and operating system delays. Parameter transformations are an essential part of the RPS mechanism when dealing with heterogeneous RPC systems. However these can be optimized as discussed under the discussion on data representations. For large parameters, network transmission time may
well become a bottleneck in RPC systems. This is especially true for most current
RPC systems as most of them are networked using slow 10 Mbits/s Ethernet. However if the physical transmission medium were to be changed to optic transmission
media, we can expect a signicant improvement in RPC performance. However, for
remote calls with very small or even no parameters, the bottleneck is indisputably the
operating system overhead. This includes page swapping, network queuing, routing,
Scheduling etc. delays.
Draft: v.1, April 4, 1994
298
CHAPTER 7. REMOTE PROCEDURE CALLS
7.4 Heterogeneous RPC Systems
Distributed system involve increased programming complexity along two dimension,
they require dealing with system dependent details of maintaining and communication, and requires dealing with these details across a wide range of languages, systems
and networks. This section focuses on the main issues that must be addressed during
the design of an RPC system that runs on heterogeneous computer systems. Its goal
is to radically decrease the marginal cost of connecting a new type of system to an
existing computing environment and at the same time to increase the set of common
services available to the users. The RPC facility on heterogeneous computer systems
attempts to subside three problems. First, one problem of inconvenience, in which
individual either must be a user of multiple subsystem or else must accept the consequence of isolation from the various aspects of the local computing environment and
that is not acceptable. A second problem is of expense, the hardware and software
infrastrncture of computing environment is not eectively amortized, making it much
more costly than necessary to conduct specic research on a system best suited for it.
A third problem is diminishing research eectiveness, Scientist and engineers should
be doing other useful things rather than to hack around heterogeneous computing
systems.
In this section, we describe the RPC system (HRPC) associated with the Heterogeneous Computer Systems (HCS) project developed at the University of Washington[?]. The main goal of this project is to develop software support for heterogenous
computing environment and not to develop software systems that make them act and
behave as homogeneous computing environment. Consequently, this approach will
lead to reducing the cost of adding new computing systems or resources as well as
increasing the set of common resources. The HRPC provides clean interfaces between
the ve main components of any RPC system:
Stubs
Binding
Data Representation
Transport Protocol
Control Protocol
In such a system, a client or a server and its associated server can consider each of
the remaining four components as a black box. The design and use of an RPC-based
distributed application can be divided into three phases: compile time, bind time and
call time. In what follows, we discuss how the tasks involved in each phase have been
implemented in the HRPC system.
Draft: v.1, April 4, 1994
7.4. HETEROGENEOUS RPC SYSTEMS
299
7.4.1 HRPC Call Time Organization
The call-time components of RPC facility are the transport protocol, the data representation, and the control protocol. Functions of these components and their interfaces with one another is called Call-time organization. In traditional RPC facility,
all decision regarding the implementation of the various components are made at the
time of RPC facility design such as transport protocol, control protocol, date representation etc, that will make task rather simplied at run-time. At this point it
has to perform only binding, that is to acquire location of the server. The binding
typically is performed by the client, and is passed to the stub as an explicit parameter
of each call. As for example, DEC SRC RPC system delays the choice of transport
protocol until bind time, when the choice of protocol which is invisible to both client
and server, it can be made based on availability and performance.
The HRPC Call Time Interface
As mentioned in[?], the choice of transport protocol, data representation, and control
protocols are delayed until bind-time, allowing a wide variety of RPC programs.
There are well dene call-time interface between the components of HRPC call time
interface. In traditional RPC systems, it keeps small interface because at the compile
time great deal of knowledge concerning RPC call is been provided to client and
server stubs. While in HRPC such information is not available until bind -time.
The Control Component
The control components has three routines associated with each direction (send or
receive) and role (client or server). Distinction is required because control information
is generally dierent in request and reply messages. Routines for send are as follows:
Call: client uses for sending initial call message to some service.
InitCall: performs any initialization necessary prior to beginning the call.
CallPacket: perform any functions peculiar to the specic protocol, when it is
necessary to actually send a segment of complete message.
FinishCall: terminates the call.
Routines for receiving are as follows:
Request: perform function associated with a service receiving a call.
Reply: used when the server replies to the client request.
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
300
Answer: concerned with receiving that reply on the client end.
CloseRpc: used by user level routines to notify the RPC facility that its service
are no longer needed.
Data Representation Components
The number and purpose of the data representation procedures are driven by the
interface description language (IDL). There is one routine for each primitive type
and/or type constructor dened by the IDL. HRPC uses modied Courier as the IDL.
Each routine translates an item of a specic data type in host machine representation
to/from some standard over-the-wire format. Each routine is capable of encoding and
decoding data, which is determined by the state of call.
Transport Components
The functions of the transport routines are of opening and closing a logical link
according to the peculiarity of particular transport protocol, and sending or receiving
a packet. Some important routines are as follows:
Init: initialize transmission/reception related to a single call message.
Finish : terminate transmission/reception of a message.
BufAlloc: allocates memory buer for message.
Dealloc: deallocates memory buer for message at transport level. This allows
transport specic processing to occur in an orderly manner.
After dening above components, let's discuss their interfacing with one another
as shown in Figure ??. Transport component provides network buer and send and
receives in this buers. The control component is responsible for calling transport
component routines that initialize for sending or receiving, and for obtaining appropriate buers. The control component uses data representation component to insert
protocol specic control information into the message being constructed. Data representation component don't ll or empties data buers because it does not know
the distinction between user data and RPC control information, it also never directly
access transport functions. Control level function is called to dispose the buer when
it is full. Data representation routines don't have information concerning the placement of control information within a message and so it has to call control information.
Also data representation routines are ignorant of where they are operating, in client
or server context, this distinction may be important to control routines.
Draft: v.1, April 4, 1994
7.4. HETEROGENEOUS RPC SYSTEMS
301
Server
Client
RPC
call
RPC
return
RPC
call
server
stub
client
stub
control
RPC
return
control
data rep
data rep
transport
transport
RPC Message
transport
control
data
Figure 7.6: HRPC Call Time Organization
The data representation component communicates with control component via
the routines GetPacket, PutPacket implemented by AnswerPacket or RequestPacket.
These routines isolate the buer-lling routines from the role context that they are
operating in. But in the course they will know the direction, sending or receiving.
7.4.2 HRPC Stubs
Client stubs implement the calling side of an HRPC and server stubs implement the
called side of HRPC, which is described in detail in[?].
HRPC Stub Structure
The stub routines has build into them detailed knowledge of binding structures, and of
the underlying HRPC components. The stub issue the sequence of calls to underlying
component routines that implement the HRPC semantics. Client and server stubs
produced by the stub generator upon processing one specic interface specication
expressed in the Courier IDL. The input and output parameters used in the call to
the user routine are dened within the server stub itself.
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
302
HRPC Stub Generator
The purpose of stub routines is to help insulate user-level code from the details
and complexity of the RPC runtime system. The interface to an RPC service is
expressed in an interface description language. A stub generator accepts this interface
description as input and produces appropriate stub routines in some programming
language and they are compiled and linked with the actual client and server code.
The HRPC system uses a stub generator for an extended version of the Courier
lDL. The stub routine code which is responsible for marshalling parameters and actually calling user-written routines, server stub contains a dispatcher module for elding
incoming request message and deciding based on message content, which procedure
within the server interface is being called. It allows user to dene their own marshalling routines for complicated data types such as those containing pointer references. Stub generator may take on dierent values for interface depending upon the
conguration of the call time components used to make the call. It doesn't prevent
client(server) from talking servers (client). The use of "higher level" data representation protocol, in conjuction with the use of an IDL, allows the data content of message
to be mutually comprehensible by stubs written in dierent language. The HRPC
stubs does not provide direct support for the marshalling of data types involving
pointers.
7.4.3 Binding
Binding is a process by which client become associated with the server before the RPC
can take place. So before a program can make remote call, it must posses a handle for
the remote procedure. The exact nature of the handle will vary from implementation
to implementation. The process by which the client gets hold of this handle is known
as binding. There are few points about binding worth mentioning.
The binding can be done at any time prior to making the call and is often left
until run time.
The binding may change during the client program's execution.
It is possible for a client to be bound to several similar servers simultaneously.
Binding in Homogeneous Systems
Homogeneous environment involves the following steps for binding:
Naming: In a view of the binding sub-system, Naming is a process of mapping
client specied server name to the actual network address of the host on which
Draft: v.1, April 4, 1994
7.4. HETEROGENEOUS RPC SYSTEMS
303
the server resides, and also provide some kind of identication aid for detecting
any interface mismatch.
Activation The next step after Naming is activation. In some RPC design is
of creating a server process. In some system it is been generated automatically
or assumed to be active.
Port Determination Port is an addressable entity. Network address of the
host is not sucient for addressing server, because multiple server might be
active on a single host. So each server is allocated its own communication port,
which can be uniquely identied with together with network address. Thus
client stub's outgoing message and server stubs reply message can use such
location information.
Binding in Heterogeneous Systems
Basically heterogeneous system consists of many dierent kind of machines connected via network. The complications that result from the heterogeneity are
as follows:
In Heterogeneous system choice of RPC components are not xed at implementation time but they are selected dynamically. Hence HRPC binding
must perform additional processing for such selection. this task involves
rst to select components and then conguring the binding data struc ture
accordingly.
HRPC binding must proceed in a manner that can emulate the binding protocols for each of the system being accommodated.
let's review some of the common mechanism for Naming, Activation and Port
Determination, to accommodate variety of binding protocols in Heterogeneous
Computer Systems.
Naming: Naming information is typically replicated for high availability and
low access latency. Common implementation techniques include a variety of
name service and replicated le schemes.
Activation: In some systems, user is responsible for pce-activating server process. In other systems, server processes are auto-activated by the binding mechanism at import time.
Port Determination: Common implementation techniques here are more diverse than for either naming or activation, due to varying assumption about the
volatility of the information. Some common techniques used are as follows:
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
304
{ Well-Known Ports: In this technique, clients determine which port num-
bers to be use based on a widely published document or le. Programmer
who wish to build a widely exported server must apply for a port number
from centrally administered agency. For less widely exported services, a
xed port is often taken from range of uncontrolled ports with this information" hard coded" in to both the client and the server.
{ Binding Agent: In this technique, a process resides at a well-known
port on each host and keeps track of which port numbers are assigned
to servers running on that host. On such systems, the process of server
export includes telling the binding agent which port was allocated to the
server. The import side of this scheme can take several forms. i.e. Sun
RPC binding subsystem contacts a building agent on the server host, and
queries that agent for the port numbe
{ A Server-Spawning "Daemon" Process: In this technique, a process
resides at a well known port on each host., and handles server import
requests by creating a new port, forking o the appropriate process, passing
it the active port, and telling the client's import subsystem the new port
number.
Structure of HRPC Binding
As mentioned in[?]. The Binding adopted by HRPC facility has to be richer in
information than that of traditional RPC facility because the particular choice of
each of three components, namely Transport Component, Control Component and
Data Representation Component, are delayed until bind-time. Also HRPC binding
has to include information about location of server and also how to interact. Each of
three call-time components are pointed by the separate block of procedural pointers
of HRPC binding. These component routines are called via such *pointers. This
indirection allows the actual implementation of the components routines to be selected
at bind-time. Information specic to particular component is stored in a private
area associated with each component. Application program never directly access the
component structure directly, it deals with binding via HRPC system call, acquire it,
discard it as atomic type.
HRPC Binding Process
This section discusses how HRPC binding process accommodates the identied types,
naming, activation, port determination and component conguration to build the
binding data structure needed to allow calls to proceed. First look at HCS Name
Service (HNS)(see Figure g:hns.
Draft: v.1, April 4, 1994
7.4. HETEROGENEOUS RPC SYSTEMS
305
HNS capable client
import
HNS
Direct access in response to import
Other name Service
Server’s name service
export
Insular Server
Figure 7.7: Basic HNS structure
Each of the systems in the HCS network has some sort of naming mechanism,
and the HNS accesses the information in these name service directly, rather than providing a brand new name service in which to reregister the information from all the
existing name services. Thus HNS provides a global name space from which server
can access information. HNS maps names into the names stored in the underlying
name services. Such design allows newly added insular services to be available immediately to the HCS network because design allows insular systems to evolve without
consulting or having to register their presence directly with HNS. Thus eliminating
some consistency and scalability problems inherent in a reregistration based naming
mechanism. HCS-capable services are available to insular clients, although the eort
required is more since a single HCS-capable export operation can potentially involve
in placing information into each system being accommodated.
There are four pssible client-server situations and are as follows:
Case 1: Both client and server are insular. HRPC does not provide support for
such situation because it eventually involves modifying the native RPC system.
In this situation client and server can communicate directly or they can't communicate. Of course there are ways to build such gateway for communication
using HRPC but it then involves in above mentioned problem.
Case 2: The client is insular and the server is HCS-capable. At import time, what-
ever information needed by client's import subsystem will be made available in
Draft: v.1, April 4, 1994
306
CHAPTER 7. REMOTE PROCEDURE CALLS
client's environment by the HRPC runtime system. It mainly involves policy
issues for the HNS and component conguration.
Case 3: The client is HCS-capable and server is insular. This is more complex
situation. At export time, it is up to HCS-capable client to extract whatever
information the server's export subsystem placed into its environment at export
time. It involves accommodating all four of the areas of binding heterogeneity
that are identied as Naming, Activation, Port determination and Component
conguration. This situation is also more interesting because it allows HCS
programs to take advantage of the pre-existing infrastructure in the network,
which provides substantial number of services.
Case 4: Both client and server are HCS-capable: This situation is a special case of
the case 2 and case 3.
The binding process follows the following steps:
To import an insular server, and HCS-capable client, it is required to specify
two part string name containing the type : File service and instance: host name
of the desired service.
HRPC binding subsystem will rst query the HNS to determine the naming
information associated with the server. This information consists of a sequence
of binding descriptor records. A binding descriptor consists of a designator
indicating which control component, data representation component and transport component the service uses, a network address, a program number, a port
number and a ag indicating whether the binding protocol for this particular
server involves contacting a binding agent.
Fill in the parts of binding data structure; ll combination of control protocol,
data representation and transport protocol that is understood by both the server
and client. The procedural pointers are now points to routines to handle that
particular set of components.
Draft: v.1, April 4, 1994
7.5. SUN RPC
307
7.5 SUN RPC
This section describes the main RPC issues in terms of Sun RPC implementation.
Also described is the development process of a Sun RPC application. Sun RPC is the
primary mechanism for network communication within a Sun network. All network
services of the SunOS are based on Sun RPC and External Data Representation,
XDR. XDR provides portability. The Network File System, NFS, is also based on
Sun RPC/XDR.
Each RPC procedure is uniquely dened with a program number and procedure
number. The program number represents a group of remote procedures. Each remote
procedure has a dierent procedure number. Each program also has a version number.
When a program is updated or modied, a new program number does not need to be
assigned, instead the version number is updated for the existing program number.
The portmap is Sun's binding daemon which maintains the mapping of network
services and Internet addresses (ports). There is a portmap daemon on each machine
in the network. The portmap daemons reside at known (dedicated) port, where they
eld requests from client machines for server addresses. There are three basic steps
to initiate an RPC:
1. During initialization, the server process calls its host machine's portmap to
register its program and version number. The portmap assigns a port number
for the service program.
2. The client process then obtains the remote program's port by sending a request
message to the server's portmap. The request message contains the server's
host machine name, the unique program and version number of the service, and
the interface protocol (eg, TCP).
3. The client process sends a request message to the remote program's port. The
server process uses the arguments of the request message to perform the service
and returns a reply message to the client process. This step represents one
request-reply (RR) RPC cycle.
If the server doesn't complete the service successfully, it returns a failure notication, indicating the type of failure.
The Sun RPC protocol is independent of transport protocols. The Sun RPC
protocol involves message specication and interpretation. The Sun RPC layer does
not implement reliability. The application must be aware of the type of transport
protocol underneath Sun RPC. If the application is running on top of an unreliable
transport such as UDP, the application must implement its own retransmission and
Draft: v.1, April 4, 1994
308
CHAPTER 7. REMOTE PROCEDURE CALLS
timeout policy, since the RPC protocol does not provide it. The RPC protocol can,
however set up the timeout periods.
Sun RPC does not use specic call semantics since it is transport independent. The
call semantics can be determined from the underlying transport protocol. Suppose,
for example, suppose the transport is unreliable UDP. If an application program
retransmits an RPC request message after a time out and the program eventually
receives a reply, then it can infer that the remote procedure was executed at least
once. On the other hand, suppose, the transport is a reliable TCP, a reply message
implies that the remote procedure was executed exactly once.
Sun RPC Software is supported on both UDP and TCP transports. The transport
selection depends on the application programs needs. UDP may be selected if the
application can live with at-leastonce call semantics, the message sizes are smaller
than the UDP packet size (8Kbytes), or the service is required to handle hundreds
of clients. Since UDP does not keep client state information, it can handle many
clients. TCP may be selected if the application needs high reliability, at-mostonce
call semantics are required, or the message size exceeds 8 Kbytes.
Sun RPC Software uses eXternal Data Representation (XDR), which is a standard
for machine-independent message data format. This allows communication between
a variety of host machines, operating systems, and/or compilers. Sun RPC Software
can handle various data structures regardless of byte orders or layout conventions
by always converting them to XDR before sending the data over the network. Sun
calls the marshalling process of converting from a particular machine representation
to XDR format serializing. Sun calls the reverse process deserializing.
XDR is part of the presentation layer. XDR uses a language to describe data
formats. It is not a programming language. The XDR language is similar to the C
language. It is used to describe the format of the request and reply messages from the
RPCs. The XDR standard assumes that bytes are portable. This can be a problem
if trying to represent bit elds. Variable length arrays can also be represented. Sun
provides a library of XDR routines to transmit data to remote machines.
Even with the XDR library, it is dicult to write application routines to serialize and deserialize (marshal) procedure arguments and results. Since the details of
programming applications to use RPCs can be time consuming, Sun has provided a
protocol compiler called rpcgen to simplify RPC application programming.
There are three basic steps to develop an RPC program:
1. Protocol specication
2. Creation of server and client application code
3. Compilation and linking of library routines and stubs
Draft: v.1, April 4, 1994
7.5. SUN RPC
309
The rst step in the process of developing an RPC program is dening the clientserver interface. This interface denition will be an input into the protocol compiler
rpcgen, so the interface denition uses RPC language, which is similar to C. This
interface denition le contains a denition of the parameter and return argument
types, the program number and version, and the procedure names and numbers.
The rpcgen protocol compiler inputs a remote program interface denition written
in RPC Language, RPCL. The output of rpcgen is Sun RPC Software. The Sun RPC
Software includes client and server stubs, XDR lter routines and a header include le.
The client and server stubs interface with the Sun RPC library, removing the network
details from the application program. The server stub supports inetd, therefore, the
server can be started by inetd or from the command line. rpcgen can be invoked with
an option to specify the transport to use.
An XDR lter routine is created for each user dened type in the RPCL (RPC
Language) interface denition. The XDR lter routines handle marchalling and demarchalling of the user dened type into and from an XDR format for network transmission. The header include le contains the parameter and return argument types
and server dispatch tables. Although rpcgen is nice in that it can provide most of
the work for you, in some cases it can be overly simplistic. rpcgen may not provide a
needed service to a more complicated application program. If this is the case, rpcgen
can still be used to provide a 'starting point' for the low level RPC code. Or, a
programmer can create an entire RPC program application without using rpcgen at
all.
The second step in the process of developing an RPC program is creating the
server and client application code. For the server, the service procedures specied
in the protocol are created. Developing the client application code involves making
remote procedure calls for the remote services. These procedure calls will actually be
local procedure calls to client stub procedures, which coordinate the communication
and activation of the remote server. Prior to making these procedure calls, the client
application code must call the server machine's portmap daemon to obtain the port
address of the server program and to create the transport connection as either UDP
or TCP. Note that the application code can be written in a language other than C.
The last step in the process of developing an RPC program is compilation of the
client and server programs and linking the application code with the stubs and the
XDR lter routines. The client application code, the client stubs, and the XDR lter
routines are compiled and linked together creating an executable client program. The
server application code, server stubs and XDR lter routines are compiled and linked
to obtain an executable server program.
The server executable can then be started on a remote machine. Then the client
executable can be run. The development of the RPC program is complete.
Draft: v.1, April 4, 1994
310
CHAPTER 7. REMOTE PROCEDURE CALLS
7.5.1 SUN RPC Programming Example
This subsection provides all the information regarding SUN RPC Library and how to
develop distributed applications using this library.
Remote Procedure Call Library
Interprocess communication mechanism provided by RPC library for communicating
between the local and remote process is message passing. Accordingly, the arguments
are sent to the remote procedure in a message and the results are passed back to the
calling program in a message. The RPC library handles this message passing scheme
and the application need not worry how the messages are get to and from the remote
procedure. The RPC Library delivers messages using a transport, which provides
communication between applications running dierent computers. One problem with
passing arguments and results in a message is that the dierences in the local and
remote computers can lead to dierent interpretations of the same message. XDR
routines in the RPC library provide a mechanism to describe the arguments and
results in a machine independent manner, allowing you to work around any dierences
in the computer architectures.
RPC Library uses client/server model. In this model, servers oer services to
the network which the clients can access. Another way to look at the model is that
servers are resource providers and clients are resource consumers. Examples include
le server, print server. There are two types of servers, stateless and stateful servers.
A stateless server does not maintain any information, called state, about any of
the clients whereas a stateful server maintains client information from one remote
procedure call to the next. For example, a stateful le server maintains information
such as le name, open mode, current position after the rst remote procedure call.
The client just passes the le descriptor and the number of bytes to read to the
server. In constrast, a stateless le server doesn't maintain any information regarding
clients and the clients have to pass all the information into the read request including
le name, the position to within the le to begin reading, and number of bytes to
be read. Though the stateful servers are more ecient and easy to use than the
stateless servers, stateless servers have advantage over stateful servers in the presence
of failures. When a server crashes, recovery will be very easy if it is stateless than if
it is stateful. Depending on the application one is better than the other.
The RPC Library has many advantages. It simplies the development of a distributed applications by hiding all the details of the network programming. It provides
a exible model which supports a variety of application designs. It also hides operating system dependencies, thus making the applications built on RPC Library very
portable.
Draft: v.1, April 4, 1994
7.5. SUN RPC
311
BIG-ENDIAN BYTE ORDER
n
n+1
n+2
n+3
Magnitude
System/370
Increasing Significance
MSB has lowest
address
Magnitude
SPARC
Increasing Significance
LITTLE-ENDIAN BYTE ORDER
n+3
n+2
n+1
n
Magnitude
VAX
MSB is in byte with
Increasing Significance highest address
8086
Magnitude
Increasing Significance
=Postion of "sign" bit
Figure 7.8: Example 32-bit integer representation
eXternal Data Representation (XDR)
Dierent computers have dierent architectures and run dierent operating systems,
thus they often represent the same data types dierently. For example, Figure 7.8
shows the representations of 32-bit integers on four computers. All four use two's
complement notation with the sign bit adjacent to the most signicant-magniture
bit. In the representation used by System/370 and the SPARC, the most signicant
bit is in the byte with the lowest address (big-endian representation) and in the
80x86 and VAX representation, the most signicant bit is in the byte with highest
address (little-endian). Because of this integers are not portable between little-endian
and big-endian computers. Byte ordering is not the only incompatibility that leads
to non-portable data. Some architectures represent integers as one's complement,
oating point representations may vary between dierent computers.
Because of this diversity, there is a need for standard representation of data which
is machine independent and which enables data to be portable. XDR provides such
a standard. This section describes XDR in detail and explains its uses in RPC.
XDR Data Representation
XDR is a standard for the description and encoding of data. It is useful for transferring
data between computers with dierent architectures and running dierent operating
systems. XDR standard assumes that bytes are portable.
Draft: v.1, April 4, 1994
312
CHAPTER 7. REMOTE PROCEDURE CALLS
XDR standard denes a canonical representation of data, viz. a single byte order
(big endian), single oating point representation (IEEE), and so on. Any program
running on any computer creates a portable data by converting it from the local
representation to XDR representation, which is machine independent. Any other
computer can read this data, by rst converting it from XDR representation to its
local representation. Canonical representation has a small disadvantage. When two
little endian computers are communicating, the sending computer will convert all the
integers to big-endian (which is standard) before sending and the receiving computer
will convert from big-endian to small-endian. These conversions are unnecessary as
both of them have the same byte order. But this conversion overhead is very small
when compared to the total communication overhead.
An alternative to the canonical standard is to have multiple standards plus a
protocol that species which standard has been used to encode a particular piece of
data. In this approach, the sender precedes the data by a 'tag' which describes the
format of the data. The receiver makes conversions accordingly, hence this approach
is called 'receiver makes it right'.
The data types dened by XDR are a multiple of four bytes in length. Because
XDR's block size is four bytes, reading a sequence of XDR-encoded objects into a
buer results in all objects being aligned on four-byte boundaries provided that the
buer itself is so aligned. This automatic alignment ensures that they can be accessed
eciently on almost all computers. Any bytes added to an XDR object to make its
length a multiple of four bytes are called ll bytes; ll bytes contain binary zeros.
XDR Library
XDR library is a collection of functions that convert data between local and XDR
representations. The set of functions can be broadly divided into two - those that
create and manipulate XDR streams, and those that convert data to and from XDR
streams. XDR stream is a byte stream in which data exists in XDR representation.
XDR lter is a procedure that encodes and decodes data between local representation
and XDR representation and reads or writes to XDR streams. The following details
more about XDR streams and XDR lters.
XDR streams The XDR stream abstraction make XDR lters to be media inde-
pendent. XDR lter can read or write data to the XDR stream where this stream can
reside on any type of media - memory, disk etc. Filter interacts with XDR streams,
whereas XDR streams interact with actual medium. There are three kinds of streams
in XDR Library, viz. standard i/o, memory and record streams. Valid operations
that can be performed on a stream are
XDR ENCODE encodes the object into the stream.
Draft: v.1, April 4, 1994
7.5. SUN RPC
313
XDR DECODE decodes the object from the stream.
XDR FREE releases the space allocated by an XDR DECODE request.
Standard I/O Streams A standard I/O stream connects the XDR stream to a
le using the C standard I/O mechanism. When data is converted from local format
to the XDR format, it is actually written to a le. And when data is being decode, it
is read from the le. The synopsis of XDR Library routine used to create a standard
I/O stream is
void
xdrstdio_create(xdr_handle, file, op)
XDR *xdr_handle;
FILE *file;
enum xdr_op op;
xdr handle is a pointer to XDR handle which is the data type that supports the XDR
stream abstraction. le references an open le. op denes the type of operation being
performed on the XDR stream. The standard I/O streams are unidirectional, either
an encoding or decoding stream.
Memory Streams The memory streams connects the XDR stream to a block
of memory. Memory streams can access an XDR-format data structure located in
memory. This is most useful while encoding and decoding arguments for a remote
procedure call. The arguments are passed through from the calling program to the
RPC Library where they are encoded into an XDR memory stream and then passed on
to the networking software for transmission to the remote procedure. The networking
software receives the results and writes them into the XDR memory stream. The
RPC Library then invokes XDR routines to decode the data from the stream into the
storage allocated for the data. The synopsis of XDR Library routine used to create
a memory stream is
void
xdrmem_create(xdr_handle, addr, size, op)
XDR *xdr_handle;
char *addr;
u_int size;
enum xdr_op op;
xdr handle and op are the same as for standard I/O streams. The XDR stream data
is written to or read from a bolck fo memory at location addr whose length is size
bytes long. Memory streams are also unidirectional.
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
314
Record
0
Length
Data
Fragment
0
Length
Data
Fragment
0
Length
Data
Fragment
0
Length
Data
Fragment
0
Length
Data
Fragment
1
Length
Data
Fragment
31 bits
1 bit: 0= Fragment
1= Last Fragment
Figure 7.9: Record Marking
Record Streams Records in a record stream are delimited as shown in Figure ??.
A record is composed of one or more fragments. Each fragment consists of a four
byte header followed by 0 to 231 ; 1 bytes of data. A header consists of two values:
a bit that, when set indicates the last fragment of a record and 31 bits specify the
length of the fragment data. The synopsis of XDR Library routine used to create a
record stream is
void
xdrrec_create(xdr_handle, sendsize, recvsize, iohandle, readit, writeit)
XDR *xdr_handle;
u_int sendsize, recvsize;
char *iohandle;
int (*readit)(), (*writeit)();
This routine initializes the XDR handle pointed to by xdr handle. The XDR
stream data is written to a buer of size sendsize and is read from a buer of size
recvsize. The iohandle identies the medium that supplies records to and accepts
records from the XDR stream buers. This argument is passed on to readit() and
writeit() routines. readit() routine is called by the XDR lter when the XDR stream
buer is empty and writeit() routine is called when the buer is full. The synopsis of
readit() and writeit() which are the same.
int
Draft: v.1, April 4, 1994
7.5. SUN RPC
315
func(iohandle, buf, nbytes)
char *iohandle, *buf;
int nbytes;
iohandle argument is the same as the one specied in the xdrrec create() call and
can be a pointer to the standard I/O FILE type. buf is the address of teh buer to
read data into for the readit() routine and is the address of the buer to write data
from for the writeit() function. Unlike the standard I/O and memory streams, record
streams can handle encoding and decoding in one stream. Selection can be done bye
setting x op eld in the XDR handle before calling a lter.
XDR Filters XDR lters are trifunctional:
1. encode a data type.
2. decode a data type.
3. free memory that a lter may have allocated.
Basically the lters can be categorized into three as described below.
Primitive Filters The XDR Library's primitive lters are listed in Figure ??. They
correspond to the C programming language primitive types. The rst two arguments
of all the lters are same no matter what kind of data is being encoded or decoded;
primitive lters have only these two arguments. First is a pointer to the XDR stream
handle and teh second argument is the address of the data of interest and is referred
to as an object handle. The object handle is simply a pointer to any possible data
type.
Composite Filters In addition to primitive lters, XDR library provides composite
lters for commonly used data types. The rst two arguments of composite lters are
same as those for primitive lters, a pointer to XDR handle and a pointer to object
handle. The additional arguments depend on the particular lter. An example of a
composite lter is xdr string() whose synopsis is given below.
bool_t
xdr_string(xdr_handle, strp, maxsize)
XDR *xdr_handle;
char **strp;
u_int maxsize;
This lter translates between strings and their corresponding local representations.
The other examples include lters for array, union, pointer etc.
Draft: v.1, April 4, 1994
316
CHAPTER 7. REMOTE PROCEDURE CALLS
Custom Filters Filters provided by the XDR library can be used construct lters
for programmer-dened data types. These lters are referred to as custom lters. An
example of custom lter shown below.
struct date{
int day;
int month;
int year;
};
bool_t
xdr_date(xdr_handlep, adate)
XDR *xdr_handlep;
struct date *adate;
{
if (xdr_int(xdr_handlep, &adate->day) == FALSE)
return(FALSE);
if (xdr_int(xdr_handlep, &adate->month) == FALSE)
return(FALSE);
return (xdr_int(xdr_handlep, &adate->year));
}
RPC Protocol
As mentioned before RPC Library uses message-passing scheme to handle communication between the server and the client. The RPC protocol denes two types of
messages, call messages and reply messages. Call messages are sent by the clients
to the servers requesting them to execute a remote procedure. After executing the
remote procedure, the server sends a reply message to the client. All the elds in the
message are XDR standard types.
Figure 7.10 shows the format of the call message. First eld XID is the transaction
identication eld. Client basically puts a sequence number into this eld. This is
mainly used to match reply messages to outstanding call messages, it will be helpful
when the reply messages arrive out of order. Next eld, message type distinguishes
call messages from reply messages. It is 0 for call messages. Next eld is the RPC
version number, used to see if the server supports the particular version of RPC. Following that, come the remote program, version and procedure numbers. The identify
uniquely the remote procedure to be called. Next are two elds, client credentials
and client veriers, that identify a client user to a distributed application. Credential identies and a verier authenticates, just the name on an international passport
identies the bearer while the photograph authenticates the bearer's identity.
Draft: v.1, April 4, 1994
7.5. SUN RPC
317
XID
(unsigned)
Message Type
(integer=0)
RPC Version
(unsigned=2)
Program Number
(unsigned)
Version Number
(unsigned)
procedure Number
(unsigned)
Client Credentials
(struct)
Client Verifier
(struct)
Arguments
(procedure-defined)
Figure 7.10: Call message format
Figure 7.11 shows the format of the reply message. Two kinds of reply messages are
possible: replies to successful calls and replies to unsuccessful calls. Success is dened
from the point of view of RPC Library and not the remote procedure call. Successful
reply message has a transaction ID (XID) and message type is set to 1 identifying it as
a reply. The reply status eld and accept status eld together distinguish a successful
reply from an unsuccessful one. Both elds are 0 in a successful reply. There is a
server verier. Final eld in the reply message has the results returned by the remote
procedure. Unsuccessful reply messages have the same format as successful replies up
to the reply status eld, which is set to 1. The format of the remainder of the elds
depends on the condition that made the call unsuccessful.
Portmap Network Service Protocol
A client that needs to make a remote procedure call to a service must be able to
obtain the transport address of the service. The process of translating a service name
to its tranport address is called binding to the service. This subsection describes how
binding is performed by the RPC mechanism.
A transport service provides process to process message transfers across the network. Each message contains network number, host number and port number. A
port is a logical communication channel in a host; by waiting on a port, a process
receives messages from the network. A sending process does not send the messages
directly to the receiving process but to the port on which the receiving process is
Draft: v.1, April 4, 1994
318
CHAPTER 7. REMOTE PROCEDURE CALLS
XID
(unsigned)
Message Type
(unsigned =1)
Reply Status
(integer =0)
Server Verifier
(struct)
Accept Status
(integer =0)
Results
(procedure-defined)
Figure 7.11: Successful reply message format
waiting. A portmap service is a network service that provides a way for a client to
look up the port number of any remote server program which has been registered
with the service. The portmap program or portmapper is a binding service. It maintains a portmap, a list of port-to-program/version number correspondences on this
host. As the Figure 7.12 shows, both the client and the server call the portmapper
procedures. Server program calls its host portmapper to create a portmap entry.
The clients call portmappers to obtain information about portmap entries. To nd a
remote program's port, a client program sends an RPC call message to the server's
portmapper; if the remote program is registered on the server, the portmapper returens ther relevant port number in an RPC reply message. The client program can
then send RPC call messages to the remote program's port. The portmapper is the
only network service that must have an well-known(dedicated) port - port number
111. Other distributed applications can be assigned port-numbers so long as they
register their ports with their host's portmapper.
Clients and servers query and update a server's portmap by calling the portmapper
procedures listed in Table ??. After obtaining a port, a server program calls the Unset
and Set procedures to its server's portmapper. Unset clears a portmap entry if there
is one, and the Set procedure enters the server program's remote program number and
port-number into the portmap. To nd a remote program's port-number, a client calls
the Getport procedure of the server's portmapper. Dump procedure gives a server's
complete portmap. The portmapper's callit procedure makes an indirect remote
procedure call as shown in Figure 7.12. The client passes the target procedure's
Draft: v.1, April 4, 1994
7.5. SUN RPC
319
Client Machine
Network
Server Machine
A
3
Client
Program
111
2
Portmapper
B
1
4
C
Server
Program
D
Ports
Legend: 1. Server registers with portmapper
2. Client requests server’s port from portmapper
3. Client gets server’s port from portmapper
4. Client calls server
Figure 7.12: Typical portmapping sequence
program number, version number, procedure number and arguments in an RPC call
message to callit. Callit looks up the target procedure's port-number in the portmap
and sends an RPC call message to the target procedure. When the target procedure
returns results to callit, it returns the results to the client program.
Number
0
1
2
3
4
5
Table 7.1:
Name Description
Null
Do nothing
Set
Add portmap entry
Unset remove portmap entry
Getport Return port for remote program
Dump return all portmap entries
Callit Call remote procedure
Selected process management calls in Mach
7.5.2 RPC Programming
Remote procedure call programming uses the RPC mechanism to invoke procedures
on remote computers. Before a remote procedure on a remote computer is invoked,
we need to have some way to identify the remote computer. Computers have unique
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
320
Procedure 0
Version 0
.
.
.
Procedure i
Program 0
.
.
.
Procedure 0
Version m
.
.
.
Procedure j
Service
.
.
.
Procedure 0
Version 0
.
.
.
Procedure i
Program n
.
.
.
Procedure 0
Version p
.
.
.
Procedure l
Figure 7.13: Remote Procedure Identication Hierarchy
identiers (host names) to distinguish them from other computers on the network. A
client uses the host name of the server to identify the computer running the required
procedure. In SUN RPC, a remote procedure can accept one argument and return
one result. If more than one argument need to be passed, they are put in a structure
and the structure is passed to the remote procedure. Similarly if more than one result
need to be returned, they are grouped in a structure. Since there can be dierences
in the representation of data on client and server, the data is rst encoded into XDR
format before it is sent to the remote computer. The receiving computer will decode
the XDR formatted data to its local format. Usually XDR lters in the XDR Library
will do the encoding and decoding.
Remote Procedure Identication The remote procedures are organized as shown
in Figure 7.13. A network service consists of one or more programs, a program has
one or more versions, and a version has one or more procedures. A remote procedure is uniquely identied by a program number, version number and a procedure
number. 2000000-3FFFFFFF is the range of locally administered program numbers.
Users can use the numbers in this range to identify the programs they are developing.
The procedure numbers usually start at 1 and are allocated sequentially.
The RPC Library provides routines that let you maintain an RPC program number database in /etc/rpc. Each entry in this data base has the name of a distributed
service, list of alias names for this service, and the program number of this service.
Draft: v.1, April 4, 1994
7.5. SUN RPC
321
Upper level
Client Side
callrpc()
Server Side
registerrpc()
Server
Program
svc_run()
Client
Program
lower Lever
Clent side
Server Side
cInt_create()
cIntudp_create()
cInttcp_create()
cIntraw_create()
cInt_destroy()
cInt_call()
cInt_control()
cInt_freeres()
svcudp_creat()
svctcp_create()
svcraw_create()
svc_destroy()
svc_getargs()
svc_freeargs()
svc_register()
svc_unregister()
svc_sendreply()
svc_getreqset()
Transport Library
Figure 7.14: RPC library organization
This database allows the user to specify the name of a distributed service instead of
a program number and use getrpcbyname() to get the program number. This allows
the user to change the program number of a distributed service without recompiling
the service.
7.5.3 The RPC Library
The RPC library is divided into client side and server side as shown in the Figure 7.14.
Also the set of routines on each side is divided into upper and lower level routines.
The high-level routines are easy to use but are inexible where as low-level routines
provide exibility. These routines are discussed in detail below.
High Level RPC programming
High-Level RPC programming refers to the simplest interface in the RPC library for
implementing remote procedure calls, but it is very inexible compared to the lowlevel RPC routines. As shown in the Figure 7.14 the high level routines consist of
callrpc(), registerrpc(), svc run(). The detailed description of these routines follows.
registerrpc() routine is used for registering a remote procedure call with the RPC
Library. Its synopsis is as shown.
int
Draft: v.1, April 4, 1994
322
CHAPTER 7. REMOTE PROCEDURE CALLS
registerrpc(prognum, versnum, procnum, procname, inproc, outproc)
u_long prognum, versnum, procnum;
char *(*procname());
xdrproc_t inproc, outproc;
The rst three arguments are program number, version number and procedure
number which identify the remote procedure being registered. The procname is the
address of the procedure being registered. The inproc and outproc are the addresses of
the XDR lters for decoding incoming arguments and encoding the outgoing results.
registerrpc() should be explicitly called to register each procedure with the RPC
Library.
svc run() synopsis is
void
svc_run()
This routine is the RPC Library's remote procedure dispatcher. This is called by
the server after the remote procedure is registered. svc run() routine waits until it
receives a request and then dispatches the request to the appropriate procedure. It
takes care of decoding the arguments and encoding the results using XDR lters.
The synopsis of callrpc() is as follows.
int
callrpc(host, prognum, vernum, procnum, inproc, in, outproc, out)
char *host;
u_long prognum, vernum, procnum;
xdrproc_t inproc, outproc;
char *in, *out;
The host identies the remote computer and prognum, vernum, procnum identies
the remote procedure. in and out are the addresses of the input arguments and the
return values respectively. inproc is the address of the XDR lter that encodes the
arguments and outproc is the address of the XDR lter that decodes the return values
of the remote procedures.
An example of a RPC program written using only high-level RPC routines is given
below. The rst le is the header le sum.h which has the common declarations for
the client and server routines.
/******************************************************************
* This is the header file for writing client and server routines
* using high-level rpc routines.
******************************************************************/
Draft: v.1, April 4, 1994
7.5. SUN RPC
323
#define SUM_PROG ((u_long)0x20000000)
#define SUM_VER ((u_long)1)
#define SUM_PROC ((u_long)1)
struct inp_args {
int number1;
int number2;
};
extern bool_t xdr_args();
The next le is the client side le client.c, which makes the rpc call.
/**********************************************************************
* This is the main client routine which makes a remote procedure call.
**********************************************************************/
#include <stdio.h>
#include <rpc/rpc.h>
#include "sum.h"
main(argc, argv)
int argc;
char *argv[];
{
struct inp_args input;
int result;
int status;
fprintf(stdout, "Input the two integers to be added: ");
fscanf(stdin,"%d %d", &(input.number1), &(input.number2));
status = callrpc(argv[1], SUM_PROG, SUM_VER, SUM_PROC, xdr_args,
&input, xdr_int, &result);
if (status == 0)
fprintf(stdout,"The sum of the numbers is %d\n", result);
else
fprintf(stdout,"Error in callrpc\n");
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
324
}
The following is the listing of the server.c which has the server side routines.
/**********************************************************************
* This file has the server routines. Main registers and calls a
* dispatch routine which dispatches the incoming RPC requests.
**********************************************************************/
#include <rpc/rpc.h>
#include "sum.h"
main()
{
int *sum();
if (registerrpc(SUM_PROG, SUM_VER, SUM_PROC, sum, xdr_args,
xdr_int) == -1){
printf(" Error in registering rpc\n");
return(1);
}
svc_run();
}
int *sum(input)
struct inp_args *input;
{
static int result;
result = input->number1 + input->number2;
return(&result);
}
The last le is the xdr.c which has a xdr routine for encoding and decoding arguments.
/******************************************************************
* This file has the XDR routine to encode and decode the arguments.
Draft: v.1, April 4, 1994
7.5. SUN RPC
325
******************************************************************/
#include <rpc/rpc.h>
#include "sum.h"
bool_t xdr_args(xdr_handle, obj_handle)
XDR *xdr_handle;
struct inp_args *obj_handle;
{
if (xdr_int(xdr_handle, &obj_handle->number1) == FALSE)
return(FALSE);
return(xdr_int(xdr_handle, &obj_handle->number2));
}
The advantages of using High-Level RPC programming routines is, it is easy to
implement a network service since these routines hide all the network details and
provide a transport-independent interface. The disadvantages of using these routines
are it is highly inexible, it doesn't allow user to specify the type of transport to use
(defaults to UDP), it doesn't allow user to specify the timeout period for callrpc()
routine (default of 25 seconds is used).
Low-Level RPC Programming
Low-Level RPC programming refers to the lowest layer of the RPC programming
interface. This layer gives most control over the RPC mechanism and is the most
exible layer. As mentioned before RPC uses transport protocols for communication
between the processes on dierent machines. To maximize its independence from
transports, RPC Library interacts with transports indirectly via sockets. Socket is
a transient object used for interprocess communication. RPC Library supports two
types of sockets: datagram sockets and stream sockets. A datagram socket provides
an interface to datagram transport service and a stream socket provides an interface
to a virtual circuit transport service. Datagram transports are fast but unreliable
whereas stream sockets are slower but more reliable than the datagram sockets. Two
abstractions of the RPC Library that isolate the user from the transport layer are
transport handle of type SVCXPRT and the client handle of type CLIENT. Usually
these are passed to the RPC routines and are not accessed by the user directly.
When a RPC-based application is being developed using low-level routines the
following needs to be done on server side.
1. Get a transport handle.
Draft: v.1, April 4, 1994
326
CHAPTER 7. REMOTE PROCEDURE CALLS
2. Register the service with the portmapper.
3. Call the library routine to dispatch RPCs.
4. Implement the dispatch routine.
and on the client side, the following needs to be done.
1. Get a client handle.
2. Call the remote procedure.
3. Destroy the client handle when done.
Each of the above steps is explained in detail below.
Server side routines Three routines can be used to get a transport handle viz.
svcudp create(), svctcp create(), svcraw create(). As it is clear from the names,
svcudp create() is used to create a transport handle for the User Datagram Packet
(UDP) transport, svctcp create() gets a transport handle for the Transmission Control Protocol (TCP) transport and the function svcraw create gets an handle for the
raw transport. Synopses of these routines are as follows.
SVCXPRT *
svcudp_create(sock)
int sock;
SVCXPRT *
svctcp_create(sock, sendsz, recvsz)
int sock;
u_long sendsz, recvsz;
SVCXPRT *
svcraw_create()
sock is an open socket descriptor. Once the transport handle is available, the
service needs to be registered, which is done using svc register(). The synopsis of the
routine is:
bool_t
svc_register(xprt, prognum, versnum, dispatch, protocol)
SVCXPRT *xprt;
u_long prognum, versnum;
void (*dispatch)();
u_long protocol;
Draft: v.1, April 4, 1994
7.5. SUN RPC
327
This routine associates the program number, prognum, version number versnum with
the service dispatch procedure, dispatch(). If protocol is nonzero, a mapping of triple
(prognum, versnum, protocol) to xprt->xp port is established wiht the local portmapper. The synopsis of dispatch routine is,
void
dispatch(request, xprt)
struct svc_req *request;
SVCXPRT xprt;
The argument request is the service structure, whihc contains the program number,
version number and the procedure number associated with the incoming RPC request.
This dispatch routine is invoked by the RPC Library when a request associated with
this routine comes. The RPC Library routines svc getargs() is used to decode the
arguments to a procedure and svc sendreply() is used to send the results of RPC to
the client.
Client side routines Four routines clnt create(), clntudp create(), clnttcp create(),
clntraw create() are used to create a client handle. clntudp create(), clnttcp create(),
clntraw create() get a handles for UDP, TCP and raw transport respectively. Synopsis
of clnt create() is given below and synopses of other routines are similar.
CLIENT *
clnt_create(host, prognum, versnum, protocol)
char *host;
u_long prognum, versnum;
char *protocol;
host identies the remote host where the service is located. prognum and versnum are
used to identify the remote program. protocol refers to the kind of transport used and
is either \udp" or \tcp". Once a client handle is obtained, the procedure clnt call()
can be used to initiate a remote procedure call. The synonpsis of this routine is
enum clnt_stat
clnt_call(clnt_handlep, procnum, inproc, in, outproc, out, timeout)
CLIENT *clnt_handlep;
u_long procnum;
xdrproc_t inproc, outproc;
char in, out;
struct timeval timeout;
Draft: v.1, April 4, 1994
328
CHAPTER 7. REMOTE PROCEDURE CALLS
This routine calls the remote procedure procnum associated with client handle clnt handlep.
in is the address of the input arguments and the out is the address of the memory
location where the output arguments are to be placed. inproc encodes the procedure's
arguments and outproc decodes the returned values. timeout is the time allowed for
the results to come back.
The following is an example of an application written using low-level RPC routines. The header le sum.h and the XDR routine le xdr.c are same as in the program
written using high-level routines. Only the main les server.c and client.c are given
below. The steps given above are commented in the program.
#include <stdio.h>
#include <rpc/rpc.h>
#include "sum.h"
main()
{
SVCXPRT *xport_handle;
void dispatch();
/* 1. Get the transport handle. */
xport_handle = svcudp_create(RPC_ANYSOCK);
if (xport_handle == NULL){
fprintf(stderr,"Error. Unable to create transport handle\n");
return(1);
}
/* 2. Register the service. */
(void)pmap_unset(SUM_PROG, SUM_VERS);
if(svc_register(xport_handle, SUM_PROG, SUM_VERS, dispatch,IPPROTO_UDP)
== FALSE){
fprintf(stderr,"Error. Unable to register the service.\n");
svc_destroy(xport_handle);
return(1);
}
/* 3. Call the dispatch routine. */
svc_run();
fprintf(stderr,"Error. svc_run() shouldn't return\n");
svc_unregister(SUM_PROG, SUM_VERS);
svc_destroy(xport_handle);
return(1);
}
Draft: v.1, April 4, 1994
7.5. SUN RPC
329
/* 4. Implement the dispatch routine. */
void dispatch(rq_struct, xport_handle)
struct svc_req *rq_struct;
SVCXPRT *xport_handle;
{
struct inp_args input;
int *result;
int *sum();
switch (rq_struct->rq_proc){
case NULLPROC:
svc_sendreply(xport_handle, xdr_void, 0);
return;
case SUM_PROC:
if (svc_getargs(xport_handle, xdr_args, &input) == FALSE){
fprintf(stderr,"Error. Unable to decode arguments.\n");
return;
}
result = sum(&input);
if (svc_sendreply(xport_handle, xdr_int,result) ==
FALSE){
fprintf(stderr,"Error. Unable to send the reply.\n");
return;
}
break;
default:
svcerr_noproc(xport_handle);
break;
}
}
int *sum(input)
struct inp_args *input;
{
static int result;
result = input->number1 + input->number2;
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
330
return(&result);
}
The following is the client program which makes a remote procedure call.
#include <stdio.h>
#include <rpc/rpc.h>
#include "sum.h"
struct timeval timeout = { 25, 0 };
main(argc, argv)
int argc;
char *argv[];
{
CLIENT *clnt_handle;
int status;
struct inp_args input;
int result;
fprintf(stderr, "Input the two integers to be added: ");
fscanf(stdin,"%d %d", &(input.number1), &(input.number2));
/* 1. Get a client handle. */
clnt_handle = clnt_create(argv[1], SUM_PROG, SUM_VERS, "udp");
if (clnt_handle == NULL){
fprintf(stderr,"Error. Unable to create client handle.\n");
return(1);
}
/* 2. Make an RPC call. */
status = clnt_call(clnt_handle, SUM_PROC, xdr_args,
&input, xdr_int, &result, timeout);
if (status == RPC_SUCCESS)
fprintf(stderr,"Sum of the numbers is %d\n", result);
else
fprintf(stderr,"Error. RPC call failed.\n");
/* 3. Destroy the client handle when done. */
clnt_destroy(clnt_handle);
Draft: v.1, April 4, 1994
7.5. SUN RPC
331
}
7.5.4 Additional Features of RPC
This subsection discusses the additional features of SUN RPC like asynchronous RPC
and broadcast RPC.
Asynchronous RPC In a normal RPC discussed so far, a client sends a request
to the server and waits for a reply from the server. This is synchronous RPC. In
constrast, asynchronous RPC is one in which after sending a request to the server,
a client does not wait for a reply, but continues execution. If it needs to obtain
a reply, it has to make some other arrangements. There are three mechanisms in
RPC Library to handle asynchronous RPC, viz. nonblocking RPC, callback RPC
and asynchronous broadcast RPC.
Nonblocking RPC can be used in simple situations when a one-way message passing scheme is needed. For example, when synchronization messages needs to be
sent, it would be okay even if some of the messages are lost. Retransmissions and
acknowledgements would be unnecessary. This type of nonblocking RPC could be
accomplished by setting timeout value in the clnt call() zero which would indicate
the client to timeout immediately after making the call.
Callback RPC is the most powerful of the asynchronous methods. It allows fully
asynchronous RPC communication between clients and servers by enabling any application to be both a client and a server. In order to initiate a RPC callback, the server
needs a program number to call the client back on. Usually client registers a callback
service, using a program number in the range 0x40000000-0x5FFFFFFF, with the
local portmap program and registers the dispatch routine. The program number is
sent to the server as part of the RPC request. The server uses this number, when it
is ready to callback the client. The client must be waiting for the callback request.
To improve the performance, the client can send the port number instead of program
number to the server. Then the server need not send a request to the client side
portmapper for the port number. If the client calls svc run() to wait for the callback
requests, it will not be able to do any other processing. Another process needs to be
spawned that calls svc run() and waits for the requests.
In Broadcast RPC, the client sends a broadcast packet for a remote procedure to
the network and waits for numerous replies. Broadcast RPC treats all unsuccessful
responses as garbage and lters them out without passing the results from such responses to the user. The remote procedures that support broadcast RPC typically
respond only when the request is successfully processed and remain silent when they
detect an error. Broadcast RPC requests are sent to the portmappers, so only services
Draft: v.1, April 4, 1994
332
CHAPTER 7. REMOTE PROCEDURE CALLS
that register themselves with their portmapper are accessible via the broadcast RPC
mechanism. The routine clnt broadcast() is used to do broadcast RPC.
If a client has many requests to send but does not need to wait for a reply until
the server has received all the requests, the requests can be sent in a batch to reduce
the network overhead. Batch-mode RPC is suitable for streams of requests that are
related but make more sense to structure them as seperate requests rather than one
large one. The RPC requests to be queued must not, themselves, expect any replies.
They are sent with a time-out value of zero as with non-blocking RPC.
7.5.5 RPCGEN
Rpcgen is a program that assists in developing RPC-based distributed applications
by generating all the routines that interact with RPC Library, thus relieving the
application developers of the network details etc. Rpcgen is a compiler which takes in
code written in interface specication language, called RPC Language (RPCL) and
generates code for client stubs, server skeleton and XDR routines. Client stubs act
as interfaces between the actual clients and the network services. Similarly server
skeleton hides the network from the server procedures invoked by remote clients.
Thus all the user needs to do is to write server procedures and link them with server
skeleton and XDR routines generated by rpcgen to get an executable server program.
Similarly for using a network service, the user has to write client programs that make
ordinary local procedure calls to the client stubs. The gure ?? shows how a client
program and server program are obtained from client application, remote-program
protocol specication and server procedures. The rest of this section describes how a
simple RPC-based application can be constructed using rpcgen.
The protocol specication for the program is given below.
/************************************************************************
* This is the protocol specification file for a remote procedure sum
* which is a trivial procedure, taking in a struture of two integers and
* returning the sum of them.
************************************************************************/
/*****************************************************************
* The structure for passing the input arguments to the remote
* procedure.
*****************************************************************/
struct inp_args {
int number1;
int number2;
Draft: v.1, April 4, 1994
7.5. SUN RPC
333
};
/****************************************************************
* The following is the specification of the remote procedure.
****************************************************************/
program SUM_PROG{
version SUM_VERS_1{
int SUM(inp_args) = 1;
} = 1;
} = 0x20000000;
It denes one procedure, SUM, for a version, SUM VERS 1, of the remote program,
SUM PROG. These three values uniquely identify a remote procedure. When this is
compiled using rpcgen, we obtain the following.
1. A header le sum.h that hash denes SUM PROG, SUM VERS 1 and SUM. It
also contains the declarations of the XDR routine, in this case it is xdr inp args().
2. A le sum clnt.c that has the client stub routines which interact with the RPC
Library.
3. A le sum svc.c which has server skeleton. The skeleton consists of main()
routine and a dispatch routine sum prog 1(). Notice that main tries to create
transport handles for both the UDP and TCP transports. If only one type of
transport needs to be created, command line options for rpcgen should indicate
that. The dispatch routine dispatches the incoming remote procedure calls to
the appropriate procedure.
4. A le sum xdr.c which has the XDR routine for encoding and decoding to and
from XDR representation.
The listing of all the les generated when sum.x is compiled by rpcgen are given
below. The following is the header le sum.h.
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#include <rpc/types.h>
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
334
struct inp_args {
int number1;
int number2;
};
typedef struct inp_args inp_args;
bool_t xdr_inp_args();
#define SUM_PROG ((u_long)0x20000000)
#define SUM_VERS_1 ((u_long)1)
#define SUM ((u_long)1)
extern int *sum_1();
The following is the sum clnt.c
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#include <rpc/rpc.h>
#include "sum.h"
/* Default timeout can be changed using clnt_control() */
static struct timeval TIMEOUT = { 25, 0 };
int *
sum_1(argp, clnt)
inp_args *argp;
CLIENT *clnt;
{
static int res;
bzero((char *)&res, sizeof(res));
if (clnt_call(clnt, SUM, xdr_inp_args, argp, xdr_int,&res,TIMEOUT)
!= RPC_SUCCESS) {
return (NULL);
}
return (&res);
}
Draft: v.1, April 4, 1994
7.5. SUN RPC
335
The following is the le sum svc.c which has the server skeleton and the dispatch
routine.
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#include <stdio.h>
#include <rpc/rpc.h>
#include "sum.h"
static void sum_prog_1();
main()
{
register SVCXPRT *transp;
(void) pmap_unset(SUM_PROG, SUM_VERS_1);
transp = svcudp_create(RPC_ANYSOCK);
if (transp == NULL) {
fprintf(stderr, "cannot create udp service.");
exit(1);
}
if (!svc_register(transp, SUM_PROG, SUM_VERS_1, sum_prog_1, IPPROTO_UDP)) {
fprintf(stderr, "unable to register (SUM_PROG, SUM_VERS_1, udp).");
exit(1);
}
transp = svctcp_create(RPC_ANYSOCK, 0, 0);
if (transp == NULL) {
fprintf(stderr, "cannot create tcp service.");
exit(1);
}
if (!svc_register(transp, SUM_PROG, SUM_VERS_1, sum_prog_1, IPPROTO_TCP)) {
fprintf(stderr, "unable to register (SUM_PROG, SUM_VERS_1, tcp).");
exit(1);
}
Draft: v.1, April 4, 1994
336
CHAPTER 7. REMOTE PROCEDURE CALLS
svc_run();
fprintf(stderr, "svc_run returned");
exit(1);
/* NOTREACHED */
}
static void
sum_prog_1(rqstp, transp)
struct svc_req *rqstp;
register SVCXPRT *transp;
{
union {
inp_args sum_1_arg;
} argument;
char *result;
bool_t (*xdr_argument)(), (*xdr_result)();
char *(*local)();
switch (rqstp->rq_proc) {
case NULLPROC:
(void) svc_sendreply(transp, xdr_void, (char *)NULL);
return;
case SUM:
xdr_argument = xdr_inp_args;
xdr_result = xdr_int;
local = (char *(*)()) sum_1;
break;
default:
svcerr_noproc(transp);
return;
}
bzero((char *)&argument, sizeof(argument));
if (!svc_getargs(transp, xdr_argument, &argument)) {
svcerr_decode(transp);
return;
}
result = (*local)(&argument, rqstp);
if (result != NULL && !svc_sendreply(transp, xdr_result, result)) {
Draft: v.1, April 4, 1994
7.5. SUN RPC
337
svcerr_systemerr(transp);
}
if (!svc_freeargs(transp, xdr_argument, &argument)) {
fprintf(stderr, "unable to free arguments");
exit(1);
}
return;
}
The last le which is generated by the rpcgen is sum xdr.c whose listing is given
below.
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#include <rpc/rpc.h>
#include "sum.h"
bool_t
xdr_inp_args(xdrs, objp)
XDR *xdrs;
inp_args *objp;
{
if (!xdr_int(xdrs, &objp->number1)) {
return (FALSE);
}
if (!xdr_int(xdrs, &objp->number2)) {
return (FALSE);
}
return (TRUE);
}
As we have seen in the previous subsections, all these routines were written by
the user when high-level or low-level RPC routines were used while developing an
application. The following gives the client main program which makes an ordinary
procedure call to sum and the sum procedure itself which is called by the server
skeleton.
Draft: v.1, April 4, 1994
CHAPTER 7. REMOTE PROCEDURE CALLS
338
#include <stdio.h>
#include <rpc/rpc.h>
#include "sum.h"
main(argc, argv)
int argc;
char *argv[];
{
CLIENT *clnt_handle;
int *result;
inp_args input;
clnt_handle = clnt_create(argv[1], SUM_PROG, SUM_VERS_1, "udp");
if (clnt_handle == NULL){
printf("Unable to create client handle.\n");
return(1);
}
fprintf(stdout,"Input the integers to be added : ");
fscanf(stdin,"%d %d", &(input.number1), &(input.number2));
result = sum_1(&input, clnt_handle);
if (result == NULL)
fprintf(stdout,"Remote Procedure Call failed\n");
else
fprintf(stdout,"The sum of the numbers is %d\n", *result);
}
The following is the sum procedure itself.
#include <stdio.h>
#include <rpc/rpc.h>
#include "sum.h"
int *sum_1(input)
inp_args *input;
{
static int result;
Draft: v.1, April 4, 1994
7.5. SUN RPC
339
Figure 7.15:
result = input->number1 + input->number2;
return(&result);
}
The gure 7.15 shows how the executable client and server are constructed.
Draft: v.1, April 4, 1994
Bibliography
[1] A. D. BirreIl, B. J. Nelson, "Implementing Remote Procedure Calls," ACM
Transactions on Computer Systems, Feb. 1984
[2] S. Wilbur, B. Bacarisse, "Building Distributed Systems with Remote Procedure
Calls," Software Engineering Journal, Sept. 87
[3] B. N. Bershad, D. T. Ching, et al, "A Remote Procedure Call Facility for Interconnecting Heterogeneous Computer Systems," IEEE Transactions on Software
Engineering, Aug. 1987
[4] G. Coulouris, "Distributed Systems: Concepts and Design."
[5] J. Bloomer, \Ticket to Ride: Remote Procedure Calls in a Network Environment," Sun World, November 1991, pp. 39-35.
[6] S. K. Shrlvastava, F. Panzlerl, "The Design of a Rellable Remote Procedure
Call Mechanism," IEEE Transactions on Computers, July 1982
[7] A. Tanenbaum, Modern Operating Systems,
[8] W. Richard Stevens, UNIX Network Programming, Ch. 18
[9] M. Mallett, A Look at Remote Procedure Calls, BYTE, May 1991
[10] Gibbons, P. B. "A Stub Generator for Multilanguage RPC in Heterogeneous
Environments", IEEE transactions on Software Engineering, Vol. SE-13, No. 1,
Janua,y 1987, pp 77-86.
[11] Bershad, B. N., Ching, D. T., Lazowska, E. D., Sanislo, J. and Schwartz, M. "A
Remote Procedure Call Facility for Interconnecting Heterogeneous Computer
Systems", IEEE Transaction on Software Engineering, Vol. SE-13, No. 8 August
1987, pp 872-893.
341
342
BIBLIOGRAPHY
[12] Hayes, Roger and Schlichting, R. D.,"Facilitating Mixed Language Programming in Distributed System", IEEE Transactions on Software Engineering, Vol.
SE-13, No. 12, December 1987.
[13] Tanenbaum, A. 5., "Modern Operating Systems", Prentice Hall, ISBN 0-13588187-0
[14] Yemini, 5. A., Goldszmidt, G. 5., Stoyenko, A. D., Wei, Y, "CONSERT : A
High-Level-Lan. guage Approach to Heterogeneous Distributed Systems", IEEE
conference on distributed systems l989.pp 162-171.
[15] Bloomer, John, "Ticket to Ride", Sunworld, November 1991, 39-55.
[16] Birrell, Andrew D., and Nelson, Bruce Jay, "Implementing Remote Procedure
Calls", XEROX CS 83-7, October 1983.
[17] Coulouris, George F. ,and Dollimore, Jean, "Distributed Systems", AddisonWesley Publishing Co., 1988, 81-113.
[18] McManis, Chuck and Samar, Vipin, "Solaris ONC, Design and Implementation
of Transport-Independent RPC", Sun Microsystems Inc., 1991.
[19] Sun Microsystems, "Network Programming Guide", Part Number 800385010,
Revision A of 27 March, 1990, 31-165.
Draft: v.1, April 4, 1994
Chapter 8
Interprocessor Communication:
Message Passing and Distributed
Shared Memory
343
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
344
8.1 Introduction
In the development of a distributed computing archiecture, one of the primary issues
that a design must address is that of how to coordinate the communication activity
between concurrently executing processes. In dealing with the various possible forms
of communication, it is possible to classify each form into one of two basic categories.
The rst, known as synchronous communication, requires that the sending and receiving processes maintain a one-to-one interaction with each other in order to coordinate
their activities. In this scenario, when a message is output, the send process waits
for the receiving process to respond with an acknowledgement before continuing on
with its own processing. This activity also occurs in the process which receives the
message. At this end of the communication, the process must explicitly wait for a
message to be received before sending out an acknowledgement for it. The second
class of communication is known as asynchronous communication. In asynchronous
communication, the sender of a message never has to wait for a receiver's response
before proceeding with its execution. If a sender transmits multiple messages to a
receiver, however, it must be designed so as to have sucient buer space to store
the incoming message(s) or the message(s)
The implementation of either synchronous or asynchronous communication depends upon the type of model utilized to dene the process interacting. There are
three types of communication models: Remote Procedure Calls (RPC), message passing, and monitor. In the previous chapter, we addressed the design issues for remote
procedure call systems. In this chapter we focus on message passing tools and shared
distributed memory.
Draft: v.1, April 21, 1994
8.2. MESSAGE PASSING
345
8.2 Message Passing
A message is simply a physical and logical piece of information that is passed from
one process to another. The composition of a typical message usually consists of
a header and trailer, for use in dening the source and destination address as well
as error checking information, and the message that contains the information to be
transmitted. In this context, the message value is moved from one place to another
where it can be edited, or modied. This is equivalent to passing information by
value.
The implementation of interprocess communication using the message passing
scheme can be characterized into approximately tow dierent categories. The rst
of these is known as the "Dialogue Approach". In this technique, a user must rst
establish a temporary connection with the desired resource and then gain permission
to access that resource. The user's request then initiates the response and activity of
the resource. The resource, however, is not controlled by the user. The main reason
to choosing this technique is that most of the communicating processes do exist on
dierent computing environments; it is dicult in such environments to share global
variables or make references to various possible environments. Furthermore, this
scheme simplies the task of allowing a large number of processes to simultaneously
communicate with one another.
The second category for implementing message passing is known as "Mailbox
System Approach". In this approach messages are not directly sent to and from the
receiving and sending processes, but are rather redirected into an intermediate buer
called a mailbox. This allows for information to be transferred without concern of
whether or not it can be immediately processed, as it can be stored until a process
is ready to access it. By creating an environment based on this principle, neither
the sender nor the receiver is restricted to a specic number of messages output
or processed for a given time period. This, thus, allows a greater freedom in their
attempts to communicate.
This section describes and compares dierent software tools available to assist
in developing parallel programs on distributed systems. In particular, this section
briey describes and compares Linda, ISIS, Express, and PVM. It also compares
these software tools with RPC mechanism. There is also a signicant amount of work
being done on changes to programming languages to develop a parallel FORTRAN
or parallel C++.
8.2.1 Linda
Linda is a programming tool (environment) developed at Yale University and is designed to support parallel/distributed programming using a minimum number of simDraft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
346
Operator
OUT(u)
Denition
generates a tuple u and inserts it into the Tuple Space
(TS) without blocking the calling process
IN(e)
nds a tuple in TS that has a format matching the template
e and returns that tuple deleting it from TS. If no tuple
matches the template, the calling process is blocked until
another process generates a matching tuple.
READ(e) same as IN() except that the tuple remains in TS
INP(e)
same as IN() except that if no matching tuple is found
FALSE is returned and the process is not blocked (removed
from Linda by 1989)
READP(e) same as INP() except that the tuple remains in TS (removed
from Linda by 1989)
EVAL(u) used to create live tuples (added by 1989)
Table 8.1: Linda operators
ple operations [1, 2]. Linda language emphasizes the use of replication and distributed
data structures [3]. It utilizes a 'Tuple' space as a method of passing data between
processes and as a method of starting new processes that are part of that application
[4]. In what follows, we describe the the operators that are used to access the tuple
space, how the tuple space has been designed on dierent systems to pass tuples
between processors and how tuples are used to generate new processes.
Linda uses tuple space to communicate between all processes and to create new
processes. In order to access tuple space, Linda dened ve operators originally and
by 1989 reduced this list to four operators, all of the operators are listed in Table 8.1.
The tuple dened in an OUT() call can be thought of as a list of parameters that
are being sent to another process. The templates in the IN(), READ(), INP(), and
READP() calls are a list of actual parameters or a formal parameter. An actual
parameter contains a value of a given type and a formal parameter consists of a
dened variable that can be assigned a value from a tuple.
In order for a tuple to match the template, they must have the same structure
(same number, type, and order of parameters) and any actual parameters in the
template must match the values in the tuple. The result of a tuple matching a
template is that the formal variables in the template are assigned the values in the
tuple. The implementation of these ve operators and the Tuple Space that handles
these operations are what is required to implement Linda on a system. The EVAL()
call can be thought of as the method of generating a new concurrent process. Tuple
Draft: v.1, April 21, 1994
8.2. MESSAGE PASSING
347
Space in general is a shared data pool where tuples are stored, searched, retrieved,
and deleted. Because all of the processes must be able to search the tuple space,
retrieve a tuple, and delete the tuple in an atomic process, the implementing of Linda
on a shared memory multiprocessor system is simpler than on a distributed system.
In this section, we focus on how to implement Linda on a distributed computing
environment.
Linda uses the tuple space to generate concurrently operating processes by generating a "live tuple" [4]. N. Carriero and D. Gelernter [4] describe a live tuple as
a tuple that "carries out some specied computation on its own, independent of the
process that generated it, and then turns into an ordinary, data object tuple". The
OUT() or EVAL() call can manipulate a live tuple and start a concurrent process on
another node. The advantage of Linda is that the application programmer only needs
to write the processes that need to be run and generates the live tuples that need to
be started in main. When the data tuples are available for the live tuple it will be
executed and will generate a data tuple as its output. In Figure 8.1 is an example
of the dining philosophers problem in C-Linda as written up by N. Carriero and D.
Gelernter [4].
8.2.2 ISIS
ISIS is a message passing tool developed by K. Birman and others at Cornell University to provide a fault-tolerant, simplied, programming paradigm for distributed
systems[6, 4]. In this reference, an application using a client-server paradigm where
the servers are run on all of their workstations and the client dispatches jobs to the
servers and accumulates results. Because of ISIS's fault tolerance, as long as the
client stays alive and at least one server at any time, the run will continue. ISIS
automatically keeps track of all of the processes currently running under it. If a node
crashes, the client is notied and the lost process restarted. ISIS supports a virtual
synchronous paradigm that allows a process to infer the state and actions of remote
processes using local start information[6]. ISIS was designed to be a fault tolerant
system and do to its fault tolerance was not optimized for performance. This lack
of performance in the initial release was the main area of concentration in improving
the system.
8.2.3 Express
Express is a message passing tool that was developed by Parasoft. Express consists of compilers, libraries, and development tools. The development tools include
a parallel debugging environment, a performance analysis system, and an automatic
parallelization tool (ASPAR)[7]. One of the main goals of Express [8] was that the
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
348
phil (i)
int i;
f
g
while (1) f
think ();
in ("room ticket");
in ("chopstick", i);
in ("chopstick", (i+1) %Num);
eat ();
out ("chopstick", i);
out ("chopstick", (i+l) %Num);
out ("room ticket");
g
initialize ()
f
g
int i;
for (i = 0; i < Num; i++) f
out ("chopstick", i);
eval (phil (i));
if (i < Num ; l)) out("room ticket");
g
Figure 8.1: C-Linda to Solve Dining Philosophers Problem.
Draft: v.1, April 21, 1994
8.2. MESSAGE PASSING
349
PROGRAM EXl
C
C{ Start up Express
C
CALL KXINIT
C
WRITE (G,*) 'Hello World'
STOP
END
Figure 8.2: Express 'Hello world' Program
host operating system would not change and that it could be used with code already
available. For example, if Express is loaded onto a UNIX system, it does not change
the system when running single processor programs or commands, unlike a parallel
operating system. As for using code that is already available, Express builds on top
of languages such as FORTRAN and C rather than trying to replace them. Some of
the more useful features of Express are the dierent operating modes and the ability
to run both Host-Node applications and Cubix applications.
The operating modes supported in Express simplify the process of allowing a single
program to be run on all of the nodes. An example of using single mode is the rst
example program in the users guide [8], a simple 'Hello world' program (illustrated
in Figure 8.2. In this program, it is the same code that could be written and run
on a serial system, when run under Express it looks the same as the serial system
result (even if run on multiple processors). This works because Express automatically
synchronizes the processes and only prints to the screen once when in single mode
(KSINGL(UNIT) - the default start-up mode). Single mode is designed to allow the
application to request data once for all of the nodes, get the data, and send the data
to all of the nodes.
In multiple mode (KMULTI(UNIT)), each processor can make its own requests;
however, the requests are buered until a KFLUSH command is reached. The
KFLUSH command becomes a synchronization barrier for the application. Once
all of the nodes have reached the KFLUSH command, the output (and/or inputs) are
performed in order from processor 0 upward. The use of multiple mode can be seen
in example 2 in the Users Guide [8] (Figure 8.3).
Express supports also asynchronous I/O (KASYNC(UNIT)) that allows the outputs/inputs to occur as soon as they are produced by a processor. By starting out
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
350
PROGRAM EX2
C
INTEGER ENV(4)
C
C{ Start up EXPRESS
CALL KXINIT
C
C{ Get runtime parameters
CALL KXPARA(ENV)
C{ Read a value
WRITE(6,*) 'Enter a value'
READ (5,*) IVAL
C{ Now have each processor execute independently
CALL KMULTI(6)
WRITE(6,l0) ENV(l), ENV(l), IVAL, EVAL(l)*IVAL
FORMAT(lX,
10
'I am node',I4,' and ',I4,' times',I4,' equals ',I4)
CALL KFLUSH(6)
C
STOP
END
Figure 8.3: Express Simple Multimode Program.
Draft: v.1, April 21, 1994
8.2. MESSAGE PASSING
351
in single mode, the program can get all of the 'common' runtime inputs. Then by
switching into multiple mode and using the ENV variable (which is initialized by
Express when it starts the programs), it is possible to split up the calculations across
all of the processors working on this job. This can be done without requiring any
message passing by the application program. The ability to switch modes automatically simplies the synchronization barriers and thus simplies parallel/distributed
programming.
Furthermore, Express supports two types of programming models: Host-Node and
Cubix models. In Host-Node application, there is a host program that usually gets the
user inputs, distributes the inputs to all of the node processes, gathers the results from
the node processes, and outputs the nal results. This style of parallel/distributed
programming is also call client-server. In a Cubix application, only one program is
written and it is executed on all of the processors working on this job. This does
not mean that all the processors execute the same instruction stream; they might
branch dierently depending on the processor number assigned on start-up. It is
in the Cubix style of programming that the modes become powerful. Without the
ability to distinguish modes, an application in cubix mode would end up running like
a host-node application with all of the code compiled into one program. This ability
to switch between styles is powerful because when starting to write parallel programs
(especially when using RPC) it is usually easier to put programs into a host-node
style. On the other hand, the cubix style can be very useful for porting serial code
into Express because the cubix style allows the code to request the input parameters
as in the serial program.
8.2.4 Parallel Virtual Machine (PVM)
PVM is a message passing tool that supports the development of parallel/distributed
for a 'collection of heterogeneous computing elements' [9]. To enhance the PVM programming environment, HeNCE (Heterogeneous Network Computing Environment)
has been developed [11]. HeNCE is an X-window based application that allows the
user to describe a parallel program by a task graph. In what follows, we briey describe how PVM selects a computing element to execute a process, how inter-process
communication is done in PVM, and how HeNCE helps the application programmer write a parallel application. PVM selects a computing element to run a process
on by using a component description le. The term 'component' describes part of
a program that can be executed remotely. The component description le is a list
of component names, locations, object le locations, and architectures (an example
component description le is shown in Figure 8.4, copied from [9]). When an application requests that a component be executed, this table is used to look up the type
of computer that it can be run on and where to nd the executable. A point of inDraft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
352
Name
factor
factor
factor
chol
chol
tool
factor2
Location
iPSC
msrsun
msrsun
csvax2
vmsvax
msrsun
iPSC
Object le
/u0/host/factorl
/usr/alg/math/factor
/usr/alg4/math/factor
/usr/matrix/chol
JOE:CHOL.OBJ
/usr/view/graph/obj
/u0/host/factorl
Architecture
ipsc
sun3
sum4
vax
vms
sun3
ipsc
Figure 8.4: PVM component Description File
terest is the way the component 'factor' is represented by three dierent executables
for three dierent architectures. This allows PVM to select the best computer to
start the process on based on what computers are available, what architectures the
component can execute on, and a load factor on the machines that are available and
in a designated architecture. Another point of the component description le is that
a specic architecture can be selected by creating a new component with the same
executable. For example, factor2 (in Figure 8.4) is the same executable as factor, for
the ipsc architecture; however, if an instance of factor2 is requested, it must execute
on a computer dened to be in the ipsc architecture. The calling component can then
communicate with the new process by using the component name and an instance
number (the instance number is returned when the call is made to start a component
as a new process). PVM also allows an application program to 'register' a component
name by using the entercomp() function that has four parameters, the component
name, the object lename, the object le location, and an architecture tag (the same
information that is stored in the component description le).
A simple example of how an application component starts a new process was
presented by V. Sunderam in [9] and is reproduced in Figure 8.5. In this example,
ten copies of the component 'factor' are started and the instance numbers are stored
in an array. The use of initiate() asynchronously starts a process of the component
type; however, if the new process can only be started after another process has ended,
the initiateP() function will do this.
PVM communication is based on a simple message passing mechanism using the
PVM send() and recv() functions or on shared memory constructs using shmget,
shmat, shmdt, and shmfree. In what follows, we rst discuss the PVM message
passing mechanisms and then the shared memory constructs. The send() function
requires three parameters; a component name, the instance number, and a type.
The recv() function requires one parameter; a type. The type parameter is used to
Draft: v.1, April 21, 1994
8.2. MESSAGE PASSING
353
enroll (\startup");
for (i = 0; i < 10; i++)
instance[i] = initiate (\factor");
Figure 8.5: PVM how to initiate multiple component instances
order messages received by the receiving process. The component name and instance
number are translated automatically by PVM into the processor that is running
that process. An example of the use of the PVM send and receive functions was
given by V. Sunderam [9] and is reproduced in Figure 8.6. One important dierence
between PVM and most heterogeneous communication systems (such as SUN XDR)
is that PVM does not translate into a machine independent format while sending and
translate back to the machine dependent format on receiving. According to G. Geist
and V. Sunderam [10], PVM chooses a 'standard' format based on the format that is
common to the majority of the computers in the pool and all communication is done
in this format. This is based on the theory that in most environments there is one
architecture that most of the computers adhere to, and this method allows them to
communicate with no translations being done. PVM also contains two variations on
the recv() function; recvl() and recv2(). If recv() is used, the process is blocked until
a message of that type is received. If recvl() is used, the process is blocked until either
a message of the type is received or a maximum number of messages of other types
have been received. Finally, if recv2() is used, the process is blocked until either a
message of the type is received or a time-out has been reached. PVM also provides a
broadcast primitive that sends a message to all instances of a specied component.
Along with the message passing functions is a set of shared memory functions. The
use of the shared memory functions in PVM must be fully thought out before using
them to develop parallel/distributed applications [9]. This is because on distributed
computing systems the shared memory must be emulated by PVM, and this leads to
a degradation in performance. The PVM shared memory functions are similar to the
UNIX shared memory inter-process functions. First, the shared memory segment is
allocated by using the function shmget() with two parameters; a name and a size.
Next the memory segment is attached to using a shmat function. Here PVM provides
three dierent versions shmat(), shmatoat(), and shmatint(). The shmat() function
attaches an untyped block of memory, shmatoat() attaches a block of memory that
holds oating point values, and shmatint() attaches an array of integers. An example
of using PVM shared memory communications is shown in Figure 8.7 [9]. In this
example, the shmatoat() function is used because it is storing an array of oating
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
354
/* Sending Process */
/* ||||| */
initsend();
/* Initialize send buer */
putstring("The square root of ");/* Store values in */
putint (2);
/* machine independent */
putstring("is ");
/* form */
putoat (1:414);
send("receiver", 4, 99);
/* Instance 4; type 99 */
/* Receiving Process */
/* |||||{ */
char msgl[32], msg2[4];
int num; oat sqnum;
recv(99);
getstring(msgl);
getint (&num);
getstring(msg2);
getoat (&sqnum);
/* Receive msg of type 99 */
/* Extract values in */
/* a machine specic */
/* manner */
Figure 8.6: PVM User data transfer.
Draft: v.1, April 21, 1994
8.2. MESSAGE PASSING
355
/* Process A */
/* ||| */
if (shmget("matrx",1024)) error();
/* Allocation failure */
while (shmatoat ("matrx", fp, "RW", 5));/* Try to lock & map seg */
for (i=0; i<256; i++) *fp++ = a[i];
/* Fill in shmem segment */
shmdtoat("matrx");
/* Unlock & unmap region */
/* Process B */
/* ||| */
while (shmatoat(\matrx",fp,"RU",5)); /* Lock&map; note:reader */
/* may lock before writer */
for (i=0; i<256; i++) a[i] = *fp++;
/* Read out values */
shmdtoat(\matrx");
/* Unlock & unmap region */
shmfree("matrx");
/* Deallocate mem segment */
Figure 8.7: PVM shared memory communication
point numbers. After the shared memory segment is attached to, it is accessed using
the pointer used in the shmat function. Once the process is done accessing the shared
memory, it must perform a detach operation to release the memory and allow another
process to access it. The detach function used is the equivalent of the attach function.
For example if shmatint() is used shmdtint() should be used. Finally, the last process
to access the shared memory segment should free up the memory by calling the shmfree() function. PVM also has lock facilities that work in a similar manner. In general,
the PVM communication functions are similar to the standard UNIX library functions
for inter-process communication, and therefore should simplify porting multiprocess
single workstation programs into a multiprocessor environment.
Finally, HeNCE is available to assist the application programmer in developing an
application for a group of heterogeneous computers on a network. HeNCE is based on
having the programmer explicitly specify the parallelism of the application. HeNCE
uses a graphical interface and a graph of how the functions are interrelated to dene
the parallelism in the application.
The HeNCE graphs have constructs for loops, fans, pipes, and conditional execution. An application consists of a graph of the functions and the source code for the
functions. HeNCE than obtains the parameters of the functions using HeNCE library
routines (which uses PVM to initiate processes and for communication). HeNCE
creates 'wrappers' for each function and compiles the wrappers into the nal exDraft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
356
ecutable. The wrappers perform all of the process initiation and communication.
HeNCE achieves fault tolerance through checkpointing [11]. HeNCE is an interesting tool and its proposed future enhancements will allow writing hierarchical graphs
(graphs containing graphs).
8.2.5 Discussion
It is interesting to see how diverse the programming environments are. Linda tries
to make the distributed system look like a shared memory system. ISIS concentrated
on reliability and detecting node failures and recovering. While Express and PVM
concentrated on getting the best performance possible and migrating code already
written, but even Express and PVM went about the objective dierently. Express
uses its modes to transparently perform synchronization. While PVM uses functions
that have a lot of commonality with the UNIX inter-process communication functions.
Due to the diverse goals, it is dicult to judge any one environment to be best for
all applications. When comparing any of the four to writing a parallel program using
RPC calls, all four provide valuable resources that will make writing the application
easier. When writing an application using RPC calls, the application programmer
must gure out a way to start-up the remote processes (possibly using 'rsh' calls) and
needs to handle starting communication processes if asynchronous communication is
to be used. Another complication that all four of these environments seem to help is
the problem of ending the processes cleanly in an orderly fashion. This is complicated
when using RPC because all communication is synchronous; therefore, if it is sometimes dicult to distinguish between lost packets, process that died prematurely, and
processes that ended normally without writing a signicant amount of code.
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
357
8.3 Distributed Shared Memory
Distributed Shared Memory (DSM) system is a mechanism to provide a virtual address space shared among processes running on loosely coupled computing systems.
There two kinds of parallel computers that include tightly-coupled shared-memory
multiprocessors and loosely-coupled distributed-memory Multiprocessors. A sharedmemory system simplies the programming task since it is a natural extension of a
single-CPU system. However, this type of multiprocessor has a serious bottleneck,
main memory is usually accessed via a common bus - a serialization point { that limits system size to few processors. Distributed-memory systems, on the other hand,
scale up very smoothly provided the designers choose the network topology carefully.
However, the programming model is limited to message-passing paradigm since the
processors communicate with each other by exchanging messages via the communication network. This implies that the application programmers must take care of
information exchange among the processes by using the communication primitives
explicitly as described in the pervious subsection.
Distributed shared memory scheme tries to combine the advantages of both systems by providing a virtual address space shared among processes running on a
distributed-memory system. The shared-memory abstraction gives these systems the
illusion of physically shared memory and allows programmers to use shared-memory
paradigm.
In this section, we highlight the fundamental issues in designing and implementing a distributed shared memory system and locate the parameters inuencing the
performance of DSM ststems. After discussing the key issues and algorithms for DSM
systems. A few existing DSM systems are discussed briey.
Traditionally, communication among processes in a distributed system is based
on the data passing model. Message-Passing systems or systems that support remote
procedure calls adhere to this model. The data-passing model logically extends the
underlying communication mechanism of the system; port or mailbox abstractions
along with primitives such as Send and Receive are used for interprocess communication. This functionality can also be hidden in language-level constructs, as with RPC
mechanisms. In either case, distributed processes pass shared information by value.
In contrast to the data-passing model, the shared memory model provides processes in a system with a shared address space. Application programs can use this
space in the same way they use normal local memory. That is, data in the shared space
is accessed through Read and Write operations. As a result, applications can pass
shared information by reference. The shared memory model is natural for distributed
computations running on tightly-coupled systems. For loosely coupled systems, no
physically shared memory is available to support such a model (see Figure 8.8. However, a layer of software or a modication of the hardware can provide a shared memDraft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
358
ory abstraction to the applications. The shared memory model applied to loosely
coupled systems is referred to as Distributed Shared Memory (DSM).
The advantages oered by DSM include ease of programming and portablity
achived through the shared-memory programming paradigm, the low cost of distributed memory machines, and scalability resulting from the absence of hardware
bottleneck. The advantages of distributed shared memory hav made it the focus of
recent study and have prompted the development of various algorithms for implementing the shared data model. However; to be successful as a programming model
on the loosely coupled systems, the performance of this scheme must be at least as
comparable to that of message passing paradigm.
DSM systems have goals similar to those of CPU cache memories in shared memory multiprocessors, local memories in shared memory multiprocessors with nonuniform memory access (NUMA) times, distributed chaching in networkle systems, and
distributed databases. In particular, they all attempt to minimize the access time
to potentially shared data that is to be kept consistent. Consequently, many of the
algorithmic issues that must be addressed in these systems are similar. Although
these systems therefore often use algorithms that appear similar from a distance,
their details and implementations can vary signicantly brcause of dierences in the
cost parameters and the way they are used. For example, in NUMA multiprocessors,
the memories are physically shared and the time dierential between accesses of local
and remote memory is lower than in distributed systems, as is the cost of transferring
a block of data between the local memories of two processors.
Distributed shared memory has been implemented on top of physically non-shared
memory architecture. Distributed Shared Memory described here is a layer of software
providing a logically shared memory. As Figure 8.8 shows distributed shared memory
(DSM) provides a virtual address space shared among processes in loosely coupled
distributed multiprocessor system.
8.3.1 DSM Advantages
The advantages oered by DSM are the ease and portability of programming. It is a
simple abstraction which provides application programmer with a single virtual address space. Hence the program written sequentially can be transferred to distributed
system without drastic changes. DSM hides the details of the remote communication
mechanism from processes such that programmers don not have to worry about the
data movements and the marshall and unmarshall procedures in the program. This
simplies substantially the programming task. Another important advantage is that
complex data structure can be passed by reference in DSM such as pointer. On the
contrary, in data passing mode, complex data structures have to be transfer to t the
format of communication primitives. One can view the DSM as a step to enhance
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
359
Shared Memory
Memory
CPU
Memory
CPU
Memory
CPU
Network
Figure 8.8: Distributed Shared Memory System
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
360
the transparency provided by a distributed system.
8.3.2 DSM Design Issues
A survey of current DSM systems was conducted by Nitzberg et al. [13], they showed
the important issues in designing and implementing the DSM. In this subsection, we
focus on these issues that must be addressed by the designers of the DSM systems.
A DSM designer must make choices regarding memory structure, granularity, access,
coherence semantics, scalability, and heterogeneity.
Structure and granularity:
The structure and granularity of a DSM system are closely related. Structure refers
to the layout of the shared data in memory. Most DSM systems do not structure
memory, but some structure the data as objects, language types, or even an associative
memory. Granularity reers to the size of the unit of sharing: byte, word, page, or
complex structure. In systems implemented using the virtual memory hardware of the
underlying architecture, it's convenient to choose a multiple of the hardware page size
as the unit of sharing, which normally ranging from 256 bytes to 8K bytes. Ivy [14],
one of the rst transparent DSM systems, chose the granularity of the memory as
1K bytes. Hardware implementations of DSM typically support smaller grain sizes.
For example, Dash used 16 bytes as the unit of sharing. Because shared-memory
programs exhibit locality of reference, a process is likely to access a large region of
its shared address space in a small amount time. Therefore, large page sizes reduce
paging overhead. However; sharing may also cause contention, and the larger the
page size, the greater the likelihood that more than one process will require access
to a page. A smaller page reduces the possibility of false sharing, which occurs when
two unrelated variables are placed in the same page. To avoid thrashing, the result
of false sharing, structuring the memory was adopted in some of the DSM systems.
Munin [17] and Shared Data-Object Model structured the memory as variables in
the source languages. Emerald, Choice, and Clouds [18] strucrured the memory as
objects.
Scalability:
A theoretical benet of DSM systems is that they scale better since the systems
are based on loosely coupled machines. However; the limits of scalability could be
greatly reduced by two factors: central bottleneck and global common knowledge
operations and storage. To avoid the above factors, the shared memory and the
related information should be distributed among the processors as even as possible.
Central memory manager is not preferred in terms of scalability.
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
361
Heterogeneity:
Sharing memory between two dierent types of machines seems to be not feasible
because of the overhead of translating the internal data representations. It's a bit
easier if the DSM system is structured as variables or objects in the source language.
In Agora, memory is structured as objects shared among heterogeneous machines.
Mermaid explores another approach: memory is shared in pages, and a page can
only contain one type of data. Although it works fairly well, requiring each page
containing only one type of data is just too rigid. Generally speaking, the overhead
of conversion seems to outweigh the benets gained from more computing power.
In distributed shared memory, memory mapping managers or someone called distributed shared memory controller (DSMC) implements the mapping between local
memories and the shared virtual memory address space. A DSMC must automatically transform shared memory access into interprocess communiation. Other than
mapping, their chief responsibility is to keep the address space coherent at all times;
that is, the value returned by a read operation is always the same as the value written
by the most recent write operation to the same address. A shared virtual memory
address space is partitioned into pages. Pages that are marked read-only can have
copies residing in the physical memories of many processors at the same time. But
a page marked write can reside in only one physical memory. The memory mapping
manager views its local memory as a large cache of the shared virtual memory address
space for its associated processor. Like the traditional virtual memory, the shared
memory itself exists only virtually. A memory reference causes a page fault when the
page containing the memory location is not in a processor's current physical memory.
When this happens, the memory's mapping manager retrieves the page from either
the disk or the memory of another processor. If the page of the faulting memory reference has a copy on other processors, then the memory mapping manager must do
some work to keep the memory coherent and then continue the faulting instruction.
Another key goal of the distributed shared memory is to allow processes execute on
dierent processors in parallel. In order to do so, the appropriate process management
must be integrated with DSMC.
Coherence Semantics
For programmers to write correct programs on a shared memory machine, they must
understand how parallel memory updates are propagated throughout the system.
The most intuitive semantics for memory coherence is strict consistency. In a system
with strict consistency, a read operation returns the most recently written value.
However, the strict consistency implies that access to the same memory location
must be serialized, which will become a bottleneck to the whole system. To improve
the performance, a few systems just provide a reduced form of memory coherence.
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
362
Table 8.2: Four Distributed Shared Memory algorithms.
Nonreplicated
Replicated
Non-migrating
Central
Full-replication
Migrating
Migration Read-replication
Relaxed coherence semantics allows more ecient shared access because it requires
less synchronization and less data movement. However, programs that depend on a
stronger form of consistency may not perform correctly if executed in a system that
supports only a weaker coherence. Figure 8.9 is a diagram of the intuitive denitions
of various forms of coherence semantics.
A singled-CPU multitasking machine supports the strict consistency, and a shared
memory system can only support sequential consistency if data accesses are buered.
As an illustration, consider the diagram shown in Figure 8.10[17].
R1 through R3 and WO through W4 represent successive reads and writes, respectively, of the same data, and A; B; C are threads attempting to access that data.
strict coherence requires that thread C at R1 read the value written by B at W2, and
that thread C at R2 and R3 read the value written by thread B at W4. Sequential
consistency, on the other hand, requires only that thread C at R1 and R2 read the
values written at any of W0 through W4 such that the value read at R2 does not
logically precede the value read at R1, and that thread C at R3 read either the value
written by thread A at W3 or thread B at W4.
DSM Algorithms
Among the factors that aect the performance of a DSM algorithm, migration and
replication are the two most important parameters. The properties of these two
factors were investagated by Zhou, et. al[12]. They used the combinations of these
two factors and developed four kinds of DSM algorithms, namely, the central-server,
the migration, read-replication, and full-replication. These algorithms all support
strict consistency. Table 8.2 shows the characteristics of these four algorithms in
terms of migration and replication.
Central-Server
The strategy of central server mode is using a central server which is responsible for
servicing all the access of shared data. Both read and write operations involve the
sending of a request message to the data server by the process excuting the operation,
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
363
Strict Consistency
A read returns the most recently
written value.
Sequential Consistency
The result of any excution appears
as some interleaving of the opearations of the indivisual nodes when
excuted on a multithreaded sequential machine.
Processor Consistency
Weak Consistency
Writes issued by each individual
node are never seen out of order
, but the order of writes from two
different nodes can be observed
diferently.
The programmer enforces consistency using synchronization
operators guaranteed to be
sequentially consistent.
Release Consistency
Weak consistency with two thpes
of synchronization operators: acquire and release. Each type of
operator is guaranteed to be
processor consistent.
Figure 8.9: Memory coherence semantics
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
364
W0
A
W1
W3
W2
W4
B
R1
R2
R3
C
SYNC
SYNC
Figure 8.10: A timing diagram for multiple memory requests to shared data
as depicted in Figure 8.11. Then the server receives the requests and respond either
with data in case of read request or an acknowledgment in case of write request.
This simplest model can use "request and respond" communication to realize it. For
reliability, a request will be retransmitted after timeout period. The server has to keep
a sequence number for each request so that it can detect duplication and acknowledge
back correctly. A failure can be raised if there is no response after several time-out
periods.
The most important drawback of this strategy is that itself will become a bottleneck. The central server has to handle requests sequentially. Even referencing the
data locally, it still needs to request the server, which increases the communication
activities. To distribute the server load, the shared data can be distributed onto several data servers. The clients can use multicast to locate the right data server. In this
case, the servers still have to deal with the same amount of requests. A better solution
is to partition the data by address and distribute them to several hosts. Clients only
need use simple mapping to nd the correct one.
Migration Technique
Migration
In this technique, the data is always migrated to the site where it is accessed regardless
of the type (read or write) of operations (see Figure 8.12. This is a single read single
write protocol (SRSW), data is usually migrated in blocks. In Zhou et al[12], a
manager is assigned statically to each block, which is responsible for recording the
current location of this block. A client queries the manager of the block, both to
determine the current location of the data and to inform the manager that it will be
the new owner of that block. If an application exhibits high locality of reference, the
cost of block migration is amortized over multiple accesses. Another advantage of
this scheme is that it can be integrated with the virtual memory system of the host.
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
clients
A
B
365
(1) client A requests data
from central server
2
(2) central server sends
data to client A
1
central server
C
D
clients
Figure 8.11: Central Server
1
A
B
C
D
(1) client A determines location
of desired data and sends a
migration request
(2) client C migrates the data
to client A as requested
2
Figure 8.12: Migration
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
366
1
WRITE OPERATION
A
B
2
C
D
(1) client A determines location of desired
data and sends a request for that page
(2) client C sends requested data
(3) client A multicasts invalidation
for that block
3
Figure 8.13: Read replication
Read Replication
Both of the previous approaches do not take advantages of parallel processing. Replication can be added to the migration algorithm by allowing either one site read/write
copy or multiple sites of read copies of a block. This type of replication is referred to
as "Multiple readers/single writer"(MRSW) replication. Replication can reduce the
average cost of read operations, since it allows read operations to be simultaneously
executed locally at multiple hosts. However, some of the write operations may become more expensive, since many replicas may have to be invalidated or updated for
data consistency. It is worthwhile to do so if the ratio of read operation over write is
large.
For a read operation on a data that is not local, it is necessary to acquire a
readable copy of the block containing that data, and to inform the site with the
writable copy to change to read only status before the read operation can excute. For
a write fault, either caused by not having that block or not having the write access
right, all copies of the same block must be invalidated berfore the write can proceed.
Figure 8.13 shows the write fault operation of this algorithm. The main advantage
of this method is to reduce the average cost of read operations. But the overhead
of invalidation for write faults will not be justied if the ratio of reads over writes is
small. This strategy resembles the write-invalidate algorithm for cache consistency
implemented by hardware in some multiprocessors
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
sequencer
1
A
WRITE OPERATION:
2
B
367
C
(1) client A wants to write; sends
data to sequencer
(2) sequencer orders the write request
and multicasts the write data
(3) local memory of all clients
D 3
is updated
Figure 8.14: Full Replication
Full Replication
Full replication algorithm allows data blocks to be replicated while in write operation,
in other words " multiple reader/multiple writer " (MRMW). One possible way to
keep the replicated data consistent is to globally sequence the write operations. On
way of implementation is that when a process attempts to write, the intention is
sent to the sequencer. The sequencer assigns the next sequence to the request and
multicast this request to all the replicated sites. And each site processes the write
operation in sequence number order. If the sequence number of the coming request is
not expected, then retransmission will be acquired to the one who requests. A negative
acknowledgment and a log of previous requests are implemented here. Figure 8.14
shows the write operation of this method.
Discussion
The performance of parallel program on a distributed shared memory depends primarily on two factors: the number of parallel processes in the program and how often
is the updating of shared data.
The central-server algorithm is the simplest implementation and may be sucient
for infrequent accesses to shared data, especially if the read / write ratio is low (that
is, a high percentage of accesses are writes). This is often the case with locks, as
will be discussed further below. In a fact, locality of reference and a high-blockhit ratio are present in a wide range of applications, making block migration and
replication favorable. Though block migration is more advantageous than the general
case, the migarion cost of a simple migration is expensive. A potentially serious
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
368
performance problem with this algorithms is block thrashing. For migration, it takes
the form of moving data back and forth in quick succession when interleaved data
read are made by two or more sites. It thus does not exploit the merits of paralle
processing. In contrast, the read-replication algorithm ofers a good compromise for
many applications.
However, if the read and write accesses interleave frequently, it takes the form of
blocks with read-only permissions being repeatedly invalidated soon after they are
replicated. It thus does not explore locality to its full extent. The full-replication
algorithm is suitable for small scale replication and infrequent updates. Such situations indicate poor (site) locality in references. For many applications, shared data
can be allocated and the computation can be partitioned such that thrashing is minimized. Application-controlled locks can also be used to suppress thrashing. In either
case, the complete transparency of the distributed shared memory is compromised
somewhat.
In Zhou et al. [12], a series of theoretical comparitive analysis among the above
four algorithms is conducted under some assumptions on the environment. Below is
a summary of this comparative study of these algorithms:
Migration vs. Read Replication: Typically, read replication eectively reduces
the block fault rate because, in contrast to the migration algorithm, interleaved
read accesses to the same block will no longer cause faults. Therefore, one can
expect read replication to outperform migration for a vast majority of applications.
Read Replication vs. Full Replication: The relative performance of these two
algorithms depends on a number of factors, including the degree of replication,
the read/write ratio, and the degree of locality achievable in read applications.
Generaly speaking, full replication performs poorly for large systems and when
update frequency is high.
Central Server vs. Full Replication: For small number of sites and the read/write
ratio is fairly high, full replication performs better; but if the number of sites
increases, the update costs of full replication catch up, and the preference turns
to central server.
8.3.3 DSM Implementation Issues
Being studied extensively since the early of 1980s, DSM system have been implemented using three basic approaches (some system use more than one approach):
1. hardware implementations that extend traditional caching techniques to scalable architecture's.
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
369
2. operating system and library implementations that achieve sharing and coherence through virtual memory-management mechamsm.
3. compiler implementations where shared accesses are automatically converted
into synchronization and coherence primitives.
For example, Dash system which is developed at Stanford University was a hardware implementation instance of two-level meshes structure. And Memnet, developed
at University of Delaware, was built on 200-Mbps token ring network. Plus was a
mixture of hardware and software implementation of DSM, developed at Carnegie
Mellon University.
Most of those distributed systems which are under development or already developed are using operating system and library implementation such as Amoeba, Clouds,
Mirage and V system. Some of them like Mermaid and Linda are the representatives
of combinations of approach (2) and approach (3).
Heterogeneity
Some designers are pretty ambitious in trying to integrate several heterogeneous machines in a distributed system. The major hinder here is that every machine may uses
dierent representations of basic data types. It would be better to implement it in
the DSM compiler level which can convert into dierent types in dierent machines.
However, the overhead of conversion seems overweigh the benets.
Dynamic centralized manager algorithm
The simplest way to locate data is to have a centralized server that keeps track of
all shared data. In this scheme, a page does not have a xed owner, and only the
manager or centralized server knows who the owner is. The centralized manager
resides on a node and maintains an information table which has one entry for one
page, each entry having three elds:
1. Owner eld: indicate which node owns this page.
2. Copy List eld: list all nodes that have copies of this page.
3. Lock eld: indicate whether the demading page is being accessed or not.
As long as a read copy exists, the page is not writable without an invalidation
operation. It is more benetial that a successfule write to the page has the ownership
of the page. When a node nishes a read or write request, a confrrmation message
is sent to the manager to indicate the completion of the request. The centralized
method suers from two drawbacks: (1) The sever serializes location queries, reducing
parallelism, and (2) the server may become heavily loaded and slow the entire system.
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
370
Fixed distributed manager algorithm
For large N, there maight be a bottleneck at the manager processor because it must
respond to every page fault. Instead of using a centralized manager, a system can
distribute the manager load evenly to several nodes in a xed manner.
(8:1)
H (p) = ps modN
Where p is an integer, N is the number processors, and s is the number of pages per
segment.
With this approach, when a fault occurs on page p, the faulting processor will ask
processor H (p) where the page owner is, and proceed the sequences as in centralized
manager algorithm.
Broadcast distributed manager algorithm
With this strategy, each processor manages those pages that it owns. In this scheme,
when a read fault occurs, the faulting processor p sends broadcast read request, and
the true owner responds by adding p to its copy list and sending a copy of the page to
p. simllarlly, when a write fault occurs, the faulting processor sends a broadcast write
request, and the owner gives up the ownership and migrate the page and the copy
list to the client. When th faulting processor receives the ownership, it invadilates all
the copies.
Dynamic distributed manager algorithm
The heart of dynamic distributed manager algorithm is keeping track of the ownership
of all pages in each processor local page table. The owner eld of each entry is replaced
by probable owner eld. Initially, this eld is set to the initial owner of the page. In
this algorithm, when a processor has a page fault, it sends a request to the processor
indicated by probable owner eld. If it not true, it forwards the requests to the
processor indicated by this processor's probable owner eld. This procedure goes on
until The true owner is found.
Coherence Protocol
All systems support some level of coherence. From the programmer's point of view, it
would be best that the system support strict consistency. However; it would require
all accesses to the same shared data being sequentialized. This would degrade the
performance of the system. Moreover, a parallel program species only a partial
order instead of a linear order on the events within the program. A relaxed coherence
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
371
would be sucient for the parallel applications. For example, Munin only realizes
weak consistency, Dash supports release consistency. To further increase parallelism,
virtually all DSM systems replicate data. Two types of protocols handling write faults,
write-invalidate and write-update, can be used to reinforce coherence. The writeinvalidate method is the same as that of read replication, and write-update is the same
as that of full replication. Most of the DSM systems have write-invalidate coherence
protocols. The main reason, as suggest by Li [14], is the lack of hardware support and
the ineciency caused by network latency. However; a hardware implementations of
DSM can use write-update freely, e.g. Plus, Munin uses type-specic coherence, a
scheme tailored for dierent access patterns of data. For example, Munin uses writeupdate to keep coherent data that is read much more frequently than it is written.
Page Replacement Policy
In systems that allow data to migrate around the system, two problems arise when the
available space for "caching" shared data lls up: Which data should be replaced to
free space and where should it go? In choosing the data item to be replaced, a DSM
system works almost like the caching system of a shared-memory multiprocessor.
However, unlike most caching systems, which use a simple least recently used or
random replacement strategy, most DSM systems dierentiate the status of data items
and prioritize them. For example, priority is given to share items over exclusively
owned items because the latter have to be transferred over the network. Simply
deleting a read-only shared copy of a data item is possible because no data is lost.
Once a piece of data is to replaced, the system must make sure it is not lost.
In the caching system of a multiprocessor, the item would simply be placed in main
memory. Some DSM systems, such as Memnet, use an equivalent scheme. The system
transfers the data item to a "home node" that has a statistically allocated space to
store a copy of an item when it is not needed elsewhere in the system. This method is
simple to implement, but it wastes a lot of memory. An improvement is to have the
node that wants to delete the item simply page it out onto the disk. Although this
does not waste any memory space, it is time consuming. Because it may be faster to
transfer something over the network than to transfer it to a disk, a better solution is
to keep track of free memory in the system and simply page the item out to a node
with space available to it.
Thrashing
DSM systems are particularly prone to thrashing. For example, if two nodes compete
for write access to a single data item, it may be transferred back and forth at such
a high rate that no real work can get done (a Ping-Pong eect). To avoid thrashing
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
372
with two computing writers, a programmer could specify the type as write-many and
the system would use a delayed write policy.
When nodes compete for accessing the same page, one can stop the Ping-Pong
eect by adding a dynamically tunable parameter to the coherence protocol. This
parameter determines the minimum amount of time a page will be available at one
node.
Related algorithms
To support a DSM system, synchronization operations and memory management
must be specially tuned. Semaphores, for example, are typically implemented on
shared memory systems by using spin locks. In a DSM system, a spin lock can easily
cause thrashing, because multiple nodes may heavily access shared data. For better performance, some systems provide specialized synchronization primitives along
with DSM. Clouds provides semaphores operation by grouping semaphores into centrally managed segments. Munin supports the synchronization memory type with
distributed locks. Plus supplies a variety of synchronization instructions, and supports delayed execution, in which the synchronization can be initiated, then later
tested for successful completion.
Memory management can be restructured for DSM. A typical memory allocation
scheme (as in the C library malloc()) allocates memory out of a common pool, in
which the search of all shared memory can be expensive. A better approach is to
partition available memory into private buers on each node and allocate memory
from the global buer space only when the private buer is full.
The implementation issues that we discussed in this section is by no means complete. A good survey of the issues have been discussed in Nitzberg [13], which shows
the various options of the design parameters. Table 8.3 summarizes the desing parameters and options adopted in several DSM projects.
Draft: v.1, April 21, 1994
8.3. DISTRIBUTED SHARED MEMORY
Table 8.3: Survey of DSM Design Parameters
System
Name
Current
Implementation
Dash
Hardware,
modied Silicon
Graphics Iris
4D/340 workstations, mesh
Software, Apollo
workstations,
Apollo ring,
modied Aegis
Software,
variety of
environments
Hardware,
token ring
Ivy
Linda
Memnet
373
Mermaid Software, Sun
workstations
DEC Firey
multiprocessors,
Mermaid/native
operating system
Mirage
Software, VAX
11/750, Ethernet, Locus distributed operating system, Unix
System V interface
Munin
Software, Sun
workstations,
Ethernet, Unix
System V kernel
and Presto parallel programming
environment
Plus
Hardware and
software,
Motorola 88000,
Caltech mesh,
Plus kernel
Shiva
Software,
Intel iPSC/2,
hypercube,
Shiva/native
operating system
Structure
Coherence Coherence
and
Semantics Protocol
Granularity
16 bytes
Release
Writeinvalidate
Sources of
Improved
Performance
Relaxed
coherence,
prefetching
Support
for Synchronization
Queued locks,
atomic incrementation and
decrementation
Heterogeneous
Support
No
l-Kbyte
pages
Strict
Writeinvalidate
Pointer chain
collapse. selective broadcast
No
Tuples
No
mutable
data
Strict
Varied
Hashing
Synchronized
pages, semaphores, event
counts
Writeinvalidate
Vectored interrupt support
of control ow
32 bytes
?
No
8 Kbytes
(Sun),
1 Kbyte
(Firey)
Strict
Writeinvalidate
512-byte
pages
Strict
Writeinvalidate
Objects
Weak
Type-specic
(delayed write
update for
read-mostly
protocol)
Page for
sharing,
word for
coherence
Processor
Nondemand
write-update
Delayed
operations
Complex
No
synchronization
instructions
4-Kbyte
pages
Strict
Writeinvalidate
Data structure
compaction,
memory as
backing store
Messages for
semaphores
and signal/
wait
Kernel-level
implementation, time
window
coherence
protocol
Delayed
update
queue
Messages for
semaphores
and signal/
wait ory
Yes
Unix System V
semaphores
No
Synchronized
objects
No
Draft: v.1, April 21, 1994
No
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
374
8.4 DSM Existing Systems
The design and implementation techniques discussed in the previous section are common in all DSM systems, but dierent systems implement them in a variety of ways.
For Instance IVY is a software implementation of a DSM system. Another system,
Clouds, uses concepts such as passive objects and threads of execution in its objectoriented approach. Hardware implementations of DSD even exist, as in MemNet
system. These systems are discussed briey in the next subsections.
8.4.1 Ivy System
Ivy system is one of the rst DSM implementation, which provides strict consistence
and write invalidate protocol. Li [14] presented two classes of algorithms, centralized
and distributed, for solving coherence problem. All the algorithms utilize replication
to enhance the performance. A prototype on Apollo ring based on these algorithms
was implemented. Many existing DSM systems are a modication of Ivy.
Ivy supports strict consistence, and write-invalidate (MRSW) coherence protocol,
the shared memory is partitioned into pages of size 1K bytes. The memory mapping
managers implement the mapping between local memories and the shared virtual
memory address space. Other than mapping, their chief responsibility is to keep the
address space coherent at all times.
Li classied the algorithms by two factors, page synchronization and page ownership. The approaches to page synchronization are write-invalidate and write-update.
As mentioned before, the authors argued that write-update is not feasible since this
approach requires special hardware support and network latency is high. The ownership of a page can be xed or dynamic. The xed strategy corresponds to the
algorithms that does not migrate the data. The dynamic methods are those that
migrate the data. The authors argued that the non-migration method is an expensive solution for existing loosely coupled multiprocessors, and it constrains desired
modes of parallel computation. Thus they only consider the algorithms with dynamic
ownership and write invalidation, which correspond to the class of read replication
discussed previously. The authors further categorized read replication algorithms into
three sets: centralized manager, xed distributed manager, dynamic manager. All
The above algorithms are described by using fault handlers, their servers, and the
data structure on which they operate. The data structure, referred to as page table,
common to all the algorithms contains three items about each page:
1. access: indicates the accessibility to the page,
2 copy set: contains the processor numbers that have read copies of the page, and
Draft: v.1, April 21, 1994
8.4. DSM EXISTING SYSTEMS
375
3 lock: synchronizes multiple pages faults by dierent processes on the same processor and synchronizes remote page requests.
Centralized Manager
The centralized manager is similar to a monitor. The manager resides on a single
processor and maintains a table called Info which has one entry for each page, each
entry has three elds:
1. The owner eld contains the single processor that owns that page, namely, the
recent processor to have write access to this page.
2. The copy set eld lists all processors that have copies of the page.
3. The lock eld is used for synchronizing requests to the page.
Now each processor has only two elds, access and lock for each page. For a read
fault, the processor asks the manager for read access to the page and inquires a copy
of that page. The manager is responsible for asking the page's owner to send a copy to
the requesting node. Before the manager is ready to process the next request, it must
receive a conrmation message from the current requesting node. The conrmation
message indicates the completion of a request to the manager so that the manager
can give the page to some one else. Write faults are processed in the same manner
except that the manager needs to invalidate the read copies for the owner. This
algorithm can be improved by moving the synchronization of page ownership to the
owners, thus eliminating the conrmation operation to the manager. The copy set
is maintained by the current owner. The locking mechanism on each processor now
deals not only with multiple local requests, but also with remote requests.
Fixed Distributed Managers
In this method, every processor is given a predetermined subset of the pages to
manage. The primary diculty in such a scheme is choosing an appropriate mapping
from pages to processors. The most straightforward approach is to distribute pages
evenly in a xed manner to all processors. For example, Given M pages and N
processors, the hashing function H (p) = pmodN , where 0 < p <= M , can be used
to distribute the pages. To realize this method, some input from the programmer
or the compiler might be needed to indicate the properties of the suitable mapping
functions.
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
376
Dynamic Distributed Managers
The heart of a dynamic distributed manager algorithm is keeping track of the ownership of all pages in each processor's local page table. To do this, the owner eld is
replaced with another eld, probOwner; whose value can be either the current owner
or an old owner. As explained in the section entitled implementation issues, When
a processor has a page fault, it sends a request to the process or indicated by the
probOwner eld for that page. If the receiver is the current owner; it replies the request node and with a copy of that page, as well as the copy set if it's a write request,
otherwise, it forwards the request to the processor indicated by its probowner eld of
that page.
8.4.2 Mirage
Mirage was implemented in the kernel of an early version of Locus compatible with
UNIX system V interface specications [16]. The approach used in Mirage is dierent
from Ivy in the following ways:
1. The model is based on paged segmentation, the page size is 512 bytes.
2. The unit of sharing is a segment, the unit of coherence is a page.
3. A tuning parameter; delta, used to avoid thrashing is provided.
4. It adopts xed distributed managers algorithm by assigning the creator of a
segment as the manager or, in Mirage's terminology, the library site.
5. As mentioned above, it was implemented in the kernel of the operating system
to improve the performance.
6. The environment consists of three VAX 1 1/750s networked together using Ethernet, which is smaller than that of Ivy's.
8.4.3 Clouds
Clouds is an object-oriented DSM system which may seem unconventional in comparison to most other software or hardware implementations. It employs the concept
of threads and passive objects. Threads are ows of control which execute on Clouds
object. At any given time during its existence, a Clouds thread executes within the
object it most recently invoked. Objects are logically cohesive groupings of data
composed of variable-size segments. The Clouds system supports the mobility of
segments, and therefore indirectly supports object.
Draft: v.1, April 21, 1994
8.4. DSM EXISTING SYSTEMS
377
When a thread on one machine invokes an object on another machine, one of two
things can occur depending upon the specic implementation. One possibility is that
the thread is constructed at the host containing the desired object. The reconstructed
thread is then executed within the desired object, and the result is nally passed back
to the calling thread (see Figure 8.15). The second possibility involves the migration
of the invoked object to the host where the calling thread resides, after which the
invocation is executed locally (see Figure 8.16).
HOST A
reconstructed
thread P
APPROACH 1
HOST B
(1) thread P info sent to host B
(2) host B reconstructs thread P
(3) the reconstructed thread
executes within the desired
object
(4) the results are sent back to
host A
thread P
3
2
1
object
4
Figure 8.15: Object invocation through thread reconstruction
HOST A
HOST B
APPROACH 2
(1) the request for the object
is sent to host B
thread P
3
(2) host B sends the object back
to host A
(3) thread P is executed locally
within the object
1
object
2
Figure 8.16: Object invocation through passing objects
Clouds supports variable object granularity, since objects are composed of variableDraft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
378
host
host
host
bus
interface
...
memnet
cache
Figure 8.17: MemNet Architecture
size segments; often, the page sizes of the host machines place a lower limit on segment
size.
The cache coherence protocol of Clouds applies to segments. It dictates that a
segment must be disposed of when access to it has been completed. This eliminates
the need for invalidations when a segment is written to, but it forces the re-fetching
of the object when it is invoked again.
8.4.4 MemNet
Designers of the MemNet system goal was to improve the poor data/overhead ratio of
the interprocess communication (IPC) that is common in software implementations.
They chose a token-based ring hardware implementation. The computers in the
MemNet system are not connected directly to the token-ring directly but instead are
connected via a "MemNet device" interface 8.17. When a shared memory reference
to a specic "chunk" (32-byte block) of memory is made, it is passed to the MemNet
device. If the device indicates that the request can be satised locally, then no network
activity is required. If the chunk is currently resident elsewhere, a request message
for the chunk is sent. What happens next is determined by the type of the request.
On a read request, a message requesting the desired chunk specically for read is
composed and sent. It travels around the token-ring and is inspected by each MemNet
device in turn. When the request reaches the MemNet device that actually has the
desired chunk, the request message is converted to a data message, and the chunk
Draft: v.1, April 21, 1994
8.4. DSM EXISTING SYSTEMS
379
data is copied into it. The message then travels back to the devices on the ring ignore
the message because they recognize that the data has already been lled in.
When a particular host wants to write to a non-resident chunk, it not only must
receive the data from the current owner, but also is responsible for invalidating any
other copies of the chunk that may exist at other MemNet devices. A write request is
sent onto the network, and each MemNet device examines the request as it passes by.
In a "snoopy cache" fashion, the ID of the desired chunk is checked in the write request
message. If a copy of the chunk exists locally, then the MemNet device invalidates
it. The message then proceeds to the next host. Note that even after the real owner
has lled the data in the message, any MemNet device between the owner and the
faulting host that has a copy of the chunk has invalidated its copy.
The nal case is an invalidation message. This type of message is sent when
a faulting MemNet device already contains the chunk it is about to write to; the
message is strictly to invalidate duplicates that may exist in the system. Its behavior
is very similar to that of write requests, except the data eld of the message is never
lled in by remote devices.
The replacement policy for MemNet is random. Each MemNet device has a large
amount of memory is used to store replaced chunks, in a similar fashion as main
memory is used to store replaced cache lines in a caching system.
8.4.5 System Comparison
The most unique of the three systems was Clouds, whose object-oriented approach
led to dierences in the implementation of DSM basics. For instance, the memory
structure of both IVY and MemNet was quite regular; the at address space was
composed of xed-size sections of memory (pages, chunks). The memory structure
of the Clouds system, on the other hand, is quite irregular, consisting of variablesize segments. It would seem that keeping track of Clouds objects is therefore more
dicult in that regular "page table" type constructs cannot be used.
The Clouds system also diers from the other two in regard to its page location
and invalidation requirements. Clouds objects are basically always associated with
their owners, so there is no real page location necessary. Furthermore, since each
object is relinquished to its owner when it is discarded, the problem of invalidation
is also eliminated. However, Clouds system can not take advantage of locality of
reference, as each remote object must be migrated each time it is referenced.
With the 32-byte chunks, MemNet exhibits the highest degree of parallelism of
the three systems. Furthermore, due to the fact that the hardware-based MemNet
IPC can be orders of magnitude faster than the other schemes' software-based IPC,
the expected overhead of keeping track of the myriad of chunks in the system is
minimized. This speedup is further supported by MemNet's ecient technique of
Draft: v.1, April 21, 1994
CHAPTER 8. INTERPROCESSOR COMMUNICATION: MESSAGE PASSING
AND DISTRIBUTED SHARED MEMORY
380
quickly modifying request messages into data messages.
All three systems utilize single-writer, multiple-reader protocols in which the most
recent value written to a page is returned. Clouds, however, has a weak-read consistency option which returns the value of the object at the time of the read, with no
guarantee of the atomicity of the operation. This gives the system the potential for
more concurrency,but at the same time places some of the burden of data consistency
on the programmer.
Draft: v.1, April 21, 1994
Bibliography
[1] Ahuja, N. Carriero, and D. Gelernter, "Linda and Friends", ILTE Computer,
vol. 19, August 1986, pages 26-34.
[2] N. Carriero and D. Gelernter, "The S/Net's Linda Kernel", Proceedings Symposium Operating System Principles, December 1985.
[3] L. Dorrmann and M. Herdieckerho, "Parallel Processing Performance in a
Linda System", 1989 International Conference on Parallel Processing, pages
151-158.
[4] N. Carriero and D. Gelernter, "Linda in Context", Communications of the
ACM, vol.32, number 4, April 1989, pages 444-458.
[5] R. Finch and 5. Kao, "Coarse-Grain Parallel Computing Using ISIS Tool Kit",
Journal of Computing in Civil Engineering, vol. 6, number 2, April 1992, pages
233-244.
[6] K. Birman, and R. Cooper, "The ISIS Project: Real experience with fault tolerant programming system", Operating Systems Review ACM, vol. 25, number
2, April 1991, pages 103-107.
[7] J. Flower, A. Kolawa, and 5. Bharadwaj, "The Express Way to Distributed
Processing", Supercomputing Review, May 1991, pages 54-55.
[8] Express Fortran Users Guide, Version 3.0, ParaSoft Corporation, 1990
[9] V. Sunderam, "PVM: A Framework for Parallel Distributed Computing", Concurrency: Practice and Experience, December 1990, pages 315-339
[10] G. Geist and V. Sunderam, "Experiences with network based concurrent computing on the PVM system", Technical Report ORNL/TM-11760, Oak Ridge
National Laboratory, January 1991.
381
382
BIBLIOGRAPHY
[11] A. Beguelin, J. Dongarra, 0. Geist, R. Manchek, and V. Sunderam, "Graphical
Development Tools for Network-Based Concurrent Super-computing", Proceedings of Supercomputing 1991, pages 435-444.
[12] M. Stumm and 5. Zhou, "Algorithms Implememting Distributed Shared Memory", Computer; Vol.23, No. 5, May 1990, pp. 54-64.
[13] B. Nitzberg and V. Lo, "Distributed Shared Memory: A Survey of Issues and
Algorithms", Computer, Aug. 1991, pp. 52-60.
[14] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems",
ACM Trans. Computer Systems, Vol.7, No. 4, Nov. 1989, pp. 321-359.
[15] K. Li and R. Schaefer, "A Hypercube Shared Virtual Memory System", 1989
Inter. Conf. on Parallel Processing, pp. 125-132.
[16] B. Fleisch and G. Popek, "Mirage : A Coherent Distributed Shared Memory
Design", Proc. 14th ACM Symp. Operating System Principles, ACM ,NY 1989,
pp. 211-223.
[17] J. Bennet, J. Carter, and W. Zwaenepoel, "Munin: Distributed Shared Memory
Based on Type-Specic Memory Coherence", Porc. 1990 Conf. Principles and
Practice of Parallel Programming, ACM Press, New York, NY 1990, pp. 168176.
[18] U. Ramachandran and M. Y. A. Khalidi, "An Implementation of Distributed
Shared Memory", First Workshop Experiences with building Distributed and
Multiprocessor Systems, Usenix Assoc., Berkeley, Calif., 1989, pp. 21-38.
[19] M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, Coherence, and
Event Ordering in Multiprocessors", Computer, Vol. 21, No. 2, Feb. 1998, pp.
9-21.
[20] J. K. Bennet, "The Design and Implementation of Distributed Smalltalk", Proc.
of the Second ACM conf. on Object-Oriented Programming Systems, Languages
and Applications, Oct. 1987, pp. 318-330.
[21] R. Katz, 5. Eggers, D. Wood, C. L. Perkins, and R. Sheldon, "Implementing a Cache Consistency Protocol", Proc. of the 12th Annu. Inter. Symp. on
Computer Architecture, June 1985, pp. 276-283.
[22] K. Li and P. Hudak, "Memor,' Coherence in Shared Virtual Memorv Systems,"
ACM Trans. Computer Systems, Vol. 7, No.4, Nov. 1989, pp.321-359
Draft: v.1, April 21, 1994
BIBLIOGRAPHY
383
[23] P. Dasgupta, R. J. LeBlane, M. Ahamad, and U Ramachandran, "The Clouds
Distributed Operating System," IEEE Computer, 1991, pp.34-44
[24] B. Fleich and G. Popek, "Mirage: A Coherence Distributed Shared Memory Design," Proc. 14th ACM Symp. Operating System Principles, ACM, New York,
1989,pp.21 1-223.
[25] D. Lenoskietal, "The Directoiy-Based Cache Coherence Pro to col for the
Dash Multiprocessor, "Proc. 17th Int'l Symp. Computer Architecture, IEEE
CS Press, Los Alamitos, Calif., Order No. 2047, 1990, pp. 148-159.
[26] R. Bisiani and M. Ravishankar, "Plus: A Distributed Shared-Memoiy System,"
Proc. 17th Int'l Symp. Computer Architecture, WEE CS Press, Los Alamitos,Calif., Order No. 2047,1990, pp.115-124.
[27] J. Bennett, J. Carter, And W. Zwaenepoel. "Munin: Distributed Shared Memory based on Type-Specic Memoiy Coherence, "Proc. 1990 Conf Principles and
Practice of ParallelProgramming, ACM Press, New York, N.Y., 1990, pp.168176.
[28] D. R. Cheriton, "Problem-oriented shared memoiy : a decentralized aproach to
distributed systems design ",Proceedings of the 6th Internation Conference on
Distributed Computing Systems. May 1986, pp. 190-197.
[29] Jose M. Bernabeu Auban , Phillip W. Hutto, M. Yousef A. Khalidi, Mustaque
Ahamad, Willian F. Appelbe, Partha Dasgupta, Richard J. Leblanc and Umarkishore Ramachandran," Clouds{a distributed, object-based operating system:
architecture and kernel implication ",European UNIX Systems User Group Autumn Conference, EUUG, October 1988, pp.25-38.
[30] Francois Armand, Frederic Herrmann, Michel Gien and Marc Rozier, "Chorus,
a new technology for building unix systems", European UNIX systems User
Group Autumn Conference, EUUG, October 1988, ppi-18.
[31] G. Delp. \The Architecture and Implementation ofMemnet: A High-speed
Shared Memoy Computer Communication Network, doctoral disseration", University of Delaware, Nework, Del., 1988.
[32] Zhou et al., "A Heterogeneous Distributed Shared Memory," to be published
in IEEE Trans. Parallel and Distributed Systems.
Draft: v.1, April 21, 1994
Chapter 9
Load Balancing in Distributed
Systems
385
386
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
9.1 Introduction
A general purpose distributed computer system can be characterized by a networked
group of heterogeneous autonomous components which communicate together through
message passing. While these systems have the potential to deliver enormous amounts
of computing capacity, much of this capacity is untapped because of the inability to
share computing resources eciently. For example, in experiments carried out on
a cluster of workstations in a campus environment, it was found that the average
utilization is as low as 10% [Kreug90]. A high degree of reliability and overall performance can be achieved if computers shared the network load in a manner that would
better utilize the resources available. Ecient resource allocation strategies should
be incorporated in any distributed system such that transparent and fair distribution
of the load is achieved. These strategies can be either implemented at the user level,
with the support of a high level library interface or at the micro kernel level. These
decision are to be determined by the designers of the system.
Livny and Melman [13] showed that as the size of the distributed system grows,
the probability of having at least one idle processor also grows. In addition, due to the
random submission of tasks to the hosts and to the random distribution of the service
times associated with these tasks, it is often the case that some hosts are extremely
loaded while others are lightly loaded. This load imbalance causes sever delays on
the tasks scheduled on the busy host and therefore degrades the overall performance
of the system. Load balancing strategies seek to rectify this problem by migrating
tasks from heavily loaded machines to less loaded ones. Although the objectives of
load balancing schemes seem fairly simple, the implementation of ecient schemes is
not.
Load balancing involves migrating a process (a task) to a node, with the objective
of minimizing its response time as well as optimizing the overall system performance
and utilization. It also refers to cooperation among a number of nodes in processing units of one meta-job, where the set of nodes are chosen in a manner resulting
in a better response time. An example of a situation where load balancing is obviously a necessity, is in running one of today's software tools such as CAD packages
with sophisticated graphics and windowing requirements [?]. These tools are CPU
intensive and have high memory requirements. In a CAD application, a user needs
fast responses if he or she modies the design. It is evident that carrying out the
CPU-intensive calculations would better performe on a remote idle or slightly loaded
workstation while the full power of the user's workstation is dedicated to the graphics
and the display functions. Load balancing is also obviously needed in performing
parallel computations on a distributed systems.
In this chapter we present some general classications and denitions which are
often encountered when dealing with load balancing. We discuss static, adaptive and
Draft: v.1, April 26, 1994
9.1. INTRODUCTION
387
dynamic and probabilistic load balancing and present some case studies followed by
a discussion of the important issues and properties of load balancing systems. We
nally present some related work.
Draft: v.1, April 26, 1994
388
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
Opimal
Approximate
Static
Sub-optimal
Heuristic
Load Balancing
Centralised
Cooperative
Dynamic
Distributed
Non-cooperative
Figure 9.1: Casavant's classication of Load Balancing Algorithms
9.2 Classications and Denitions
In this section we mainly use [21] to clarify some of the terms and denitions often used
in the load balancing literature to present a taxonomy of the dierent schemes employed (see Figure 9.1. Load balancing falls under the general problems of scheduling
or resource resource allocation and management. Scheduling formulates the problem
from the consumers' (users', tasks') point of view whereas resource allocation solutions view the problem from the resources' (hosts', computers') side. It is a matter
of terminology and each is a dierent side of the same coin. In addition, the term
global scheduling is often used to distinguish scheduling from local scheduling which
is concerned with activities within one node or processor. Our focus here is in global
scheduling which we will refer to simply as scheduling.
Often, in the literature we encounter the term load sharing in reference to the
scheduling problem. In load sharing, the system load is shared by redistributing it
among the hosts in the hope of achieving better overall system performance (in terms
of response time, throughput or accessibility to remote services). In performing load
sharing, it is intuitive to make better use of idle, less busy and more powerful hosts.
In doing so, the end result is better utilization of the resources, faster response time
and less load imbalance. Therefore, one can think of load balancing as a subset of load
sharing schemes. We will be concerned with scheduling strategies which are designed
with the goal of reducing the load imbalance to achieve better performance.
A survey of the literature uncovers a multitude of algorithms proposed as soluDraft: v.1, April 26, 1994
9.2. CLASSIFICATIONS AND DEFINITIONS
389
tions to the load balancing problem. A rst level classication includes static and
dynamic solutions. Static solutions require a complete knowledge of the behavior
of the application and the state of the system (nodes, network, tasks), which is not
always available. Therefore, a static solution designed for a specic application might
ignore the needs of other application types and the state of the environment which
may result in an unpredictable performance. It is clear that dynamic load balancing
policies which assume very little a priori knowledge about the system and applications and which respond to the continuously changing environment and to the diverse
requirements of applications have a better chance of making correct decisions and of
providing better performance than static solutions. While early work focused on
static placement techniques, recent work evolved to adaptive load balancing. However, static solutions maybe exploited for dynamic load balancing, especially when
the system is assumed to be in steady state, or if the system has the capabilities or
the tools to determine ahead of time the eectiveness of a certain static solution on
a given architecture [Manish's dissertation]. Given long-term load conditions and the
network state, static strategies can be applied, while dynamic policies are used to
react to short-term changes in the state of the system.
In performing both static or dynamic load balancing, optimal solutions, such as
graph theoretic, Mathematical programming, or space enumeration and search, are
possible but are usually computationally infeasible. In general achieving optimal
schedules is an NP-complete problem and only possible when certain restrictions are
imposed on the program behavior or the system such as restricting the weight of
tasks to be the same and the number of nodes (processors) to be two [40]. or when
restricting the node weights to be mutually commensurable (i.e all node weights are
an integer multiple of a weight \t"). Alternatively, a suboptimal algorithm can be
devised to allocate the resources according to the observed state of the system. Such
an algorithm attempts to approach optimality at a fraction of the cost with less
knowledge than is required by an optimal algorithm. Most algorithms proposed in
the literature are of the latter kind as it has been found that for practical purposes
suboptimal algorithms oer a satisfactory performance.
Suboptimal solutions which include either approximate or heuristic solutions are
less time consuming. Approximate solutions use the same models as the optimal
solutions but require a metric which evaluates the solution space and stops when a
good solution (which is not necessarily the optimal) is reached. Heuristic solutions on
the other hand use rules of thumb to make realistic decisions using any information
available on the state of the system.
Specic to dynamic load balancing is the issue of the responsibility of making
and carrying out the decisions. In a distributed load balancing policy, this work is
physically distributed among the nodes. Whereas, in a non-distributed case, this
responsibility resides on the shoulder of a single processor. Now, this brings up
Draft: v.1, April 26, 1994
390
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
the issue of the level of cooperation and the degree of autonomy in a distributed
dynamic scheduling. Nodes can either fully cooperate to achieve a system wide goal
or they can perform load balancing independently from each other. In addition, a
distinction can be drawn between decentralized and distributed scheduling policies,
where one (decentralized) is concerned with the authority and the other (distributed)
is concerned with the responsibility of making and carrying out the decisions. Any
system which has a decentralized authority must also have a distributed responsibility,
but the opposite is not true. On the other hand, in centralized scheduling, a single
processor holds the authority of control. Especially, in large scale systems, both
centralized and decentralized control can be used. Clusters are a common way for
computer systems to exit. Each cluster could be managed by a single centralized
node. This creates a hierarchy of centralized managers across the whole system,
where control is centralized within each cluster, but is decentralized across the whole
system.
Another distinguishing characteristic of a scheduling system is how adaptive is
it. In an adaptive system, the scheduling policy and parameters used in modeling
the decision making process are modied depending on the previous and the current
response of the system to the load balancing decisions. In contrast, a non-adaptive
system would not change its control mechanism on the basis of the history of the system behavior. Dynamic systems on the other hand simply consider the current state
of the system while making decisions. If a dynamic system modies the scheduling
policy itself, it is then adaptive also. In general, any adaptive system is automatically
dynamic but the reverse is not necessarily true.
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
391
9.3 Load Balancing
In the next sections we will be discussing the dierent ways of performing load balancing, mainly, statically, dynamically, adaptively and probabilistically. The sections
on Static load balancing and dynamic/ adaptive load balancing are based on a Survey
conducted by [Harget and Johnson].
9.3.1 Static Load Balancing
Static load balancing is concerned with nding an allocation of processes to processors to that would result in minimizing the execution cost and minimizing the
communication cost due to the load balancing strategy. General optimal solutions
are NP-complete, so heuristic approaches are often used. It is assumed that the program or job consists of a number of modules and that the cost of executing a module
on a processor and the volume of data ow between modules are known, which is not
necessarily the case. Accurate estimation of task execution times and communication
delays are dicult. The problem formulation involves assigning modules to processors
in an optimal manner within given cost constraints. However, the static assignment
does not consider the current state of the system when making the placement decisions. Some of the solution methods include the graph theoretic approach, the 0-1
integer programming approach and the heuristic approach.
The Graph Theoretic Approach:
In this method, modules are represented as nodes in a direct graph. Edges
denote ow between modules and their weights denote the cost of sending data.
Intra processor communications costs are assumed to be zero. [Stone (1977)]
showed that this problem is similar to that of commodity ow networks, where
commodity ow from source to destination and the weights represent the maximum ow. A feasible ow through the network has the following property:
X flows IN = X flows out
where ows into sinks and out of sources are not-negative, and ows do not
exceed capacities. What is required is to nd the ow that has a maximum
value among all feasible ows. The Min Cut algorithm achieves that see Figure, where the letters denote the program modules and the weights of edges
between modules and processors (s1 and s2) denote the cost of running these
modules on each processor. An Optimal assignment can be found by calculating the minimum weight cutset. However, for n-processor network, we need
to determine n cut sets which is computationally expensive. An ecient implementation can be achieved if a program has a tree structured call graph.
Draft: v.1, April 26, 1994
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
392
An algorithm which can nd a least costly path through the assignment graph
can execute in time O(mn2), where m is number of modules, n is number of
processors [Bokhari, 1981].
10
2
4
6
12
A
B
F
5
8
4
12
S1
4
4
C
3
S2
6
2
D
3
5
11
5
E
The O-1 Integer Programming Approach:
For this approach, the following is dened:
{ Cij : coupling factor = the number of data units transferred form module
i to module j .
{ dkl : inter-processor distance = the cost of transferring one data unit from
processor k to processor l.
{ qik : execution cost = the cost of processor module i on processor k. If
i and j are resident on processors k and l respectively, then their total
communications cost can be expressed as Cij dkl . In addition to these
quantities the assignment variable is dened as:
Xik = 1, if module i is assigned to processor k.
0, otherwise
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
393
Using the above notation, the total cost of processing a number of user
modules is given as:
X X(qikXik + X X(cij dkl)Xik Xjl)
i
k
l
j
In this scheme, constraints can be added easily to the problem, for an
example, memory constraints, as follows:
Mi Xik Sk
X
i
Mi = memory requirements of module i
Sk = memory capacity of processor k
Non-linear programming techniques or branch and bound techniques can
be used to solve this problem. The complexity of such algorithms is NPcomplete. The main disadvantage lies in the need to specify the values of
a large number of parameters
The Heuristic Approach:
In the previous approaches, nding optimal solution is computationally expensive. Therefore, heuristic methods are used to perform load balancing. Heuristics are rules of thumb. Decisions are based on parameters which have an
indirect correlation with the system performance. These parameters should be
simple to calculate and to monitor. An example of such heuristics is to identify
a cluster of modules which pass a large volume of data between them and to
place them on the same processor. This heuristic can be formulated as follows:
1. nd the module pair with most intermodule communication and assign to
one processor if constraints are satised. Repeat this for all possible pairs
2. dene a distance function between modules that measures communications
between i and j modules relative to communications between i and all other
modules and j and all other modules. This function is then used to cluster
modules with the highest valued distance function.
Additional constraints can be added such that execution and communication costs are considered.
9.3.2 Adaptive / Dynamic Load Balancing
Computer systems and software are developing at a high rate, creating a very diverse
community of users and applications with varying requirements. Optimal xed solutions are no longer feasible in many areas, instead, adaptive and exible solutions are
Draft: v.1, April 26, 1994
394
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
required. We do feel that it is not possible to nd, for a wide range of applications,
an optimal architecture, an optimal operating system or an optimal communication
protocol.
In general, any adaptive or dynamic load balancing algorithm is composed of the
following components: the processor load measurement, the system load information
exchange policy, the transfer policy and the cooperation and location policy. These
algorithms dier in the strategy used to implement each of the above components;
some might require more information than others or might consume less computation
cycles in the process of making a decision.
A distributed environment is highly complex and many factors eect its performance. However, three factors are of utmost importance to the load balancing
activities. These factors are the following:
1. System load: in making load balancing decisions, a node requires knowledge of
its local state (load) as well as the global state of the dierent processors on the
network.
2. Network trac conditions: The underlying network is a key member of our
environment. The state of the network can either be measured by performing
experiments or predicted by estimating its state from some partial information.
Network trac conditions determine whether more cooperation can be carried
out among the dierent nodes in making the load balancing decisions (which
requires more message passing or communication) or whether the load balancing
activities should rely more on stochastic, prediction and learning techniques
(which will be discussed in Section 3.3).
3. Charachteristics of the tasks: these characteristics involve the size of the task
which is summarized by its execution and migration time. Also, tasks can be
CPU bound, I/O bound or a combination of the two. The task description
includes both quantitative parameters (e.g the number of processors required
for a parallel applications) and qualitative parameters (e.g level of precision
of the result or some hardware or software requirement). Estimating the task
execution time is helpful in making the load balancing decision. This estimate
can be passed on by the users using their knowledge of the task. Prole based
estimates generated by monitoring previous runs of similar applications are also
possible. In addition, the run time program behavior can be simulated at run
time by executing those instructions which aect the execution of the program
the most and gathering statistics about the behavior. Probabilistic estimates
have been used also.
Once the task characteristics are determined and given knowledge of the state
of other nodes, and the network, the eect of running this task locally or on
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
395
any other node can be determined.
The components of dynamic load balancing algorithms are discussed below to give
some insight into what is required of the solutions of the problem at hand.
The processor load measurement: a reasonable indicator of the processor's
load is needed. It should reect our qualitative view of the load, it should be
stable (high frequency uctuations discarded), it should be done eciently since
it will be calculated frequently and it should have a direct correlation with the
performance measures used. Many methods designed to estimate the state of
the processor have been suggested (e.g Bryan81). The question to be answered
is how do we measure a processor's load.
Some measurement readily available in a UNIX environment include the following:
1. The number of tasks in the run queue.
2. The rate of system calls.
3. The rate of CPU context switching.
4. The amount of free CPU time.
5. The size of the free available memory.
6. the 1-minute load average.
In addition, key board activity can be an indication of whether someone is physically logged on the machine or not. The question is, which of the above is the
most appropriate to measure the load. A weighted function of these measures
can be calculated. An obvious deciding factor is the type of tasks residing at a
node. For example, if the average execution time of the tasks is much less than
1 minute, then the one minute load average measure is of little importance. Or
if the tasks currently scheduled on a machine cause a lot of memory paging
activity, then the size of the free available memory might be a good indication
of the load. Ferrari (1985) proposed a linear combination of all main resource
queues as a measure of load. This technique determines the response time of
a Unix command in terms of resource queue lengths. The analysis assumes
steady-state system and certain queueing disciplines for resources. Studies have
shown however, that the number of tasks in the run queue is the best out of all
measures as an estimate of the node's load and even better than a combination
of all the measures available [Kunz91]. However, as discussed above one might
conceive of situations where other measures might make sense. Bryant and
Finkel (1981) used the remaining service time to estimate the response time
Draft: v.1, April 26, 1994
396
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
of a process arriving at one processor (which is used as an indication of the
processor's load) as follows:
RE (t) = t (the time that has already received)
R = RE (tK )
for all J 2 J (P ) Do
begin
if RE (tJ ) < RE (tK )
then R = R + RE (tJ )
else R = R + RE (tK )
end
RSPE (K; J (P )) = R
Where J (P ): set of jobs resident on processor P.
Workload characterization is essential for this component. In addition, stability
needs to be maintained such that the cost of load balancing does not outweigh
its benets over a system using no load balancing. Means of establishing the age
of the information is also essential so that stale information, which can cause
instability, is avoided. A load measure with reduced uctuations is obtained if
the load value is averaged over a period at least as long as the time to migrate
an average process [Kru84]. Additional stability is also introduced by using the
idea of a virtual load, where the virtual load is the sum of the actual load plus
the processes that are being migrated over to that processor.
The system information exchange policy: it denes the periodicity and
the manner of collecting information about the state of the system (e.g short
intervals, long intervals or when drastic changes occur in the system state), i.e
it answers the question when should a processor communicate its state to the
rest of its community. Some heuristics are required to determine the periodicity
of information collection which is a function of the network state and variance
of the load measurement. In situations of heavy trac, the nodes should refrain
from information exchange activity and resort to estimating the system state
using local information. So, one aspect of this policy involves estimating locally
the state of the trac on the network to determine the level of communication
activity that would be reasonable. This can be done by performing experiments,
such as sending a probe around the network and measuring the network delay,
or by predicting the state of the network using previous information about the
trac conditions on the network. Another aspect is stability, which can be
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
397
maintained by recognizing transient information, keeping variance information
and averaging over meaningful periods.
Dierent types of information can be exchanged. One type is status information
such as busy, idle or low neutral and heavy load [Ni85]. This is high level
information from which it would be hard to estimate the future state of a node.
On the other hand, data representing the instantaneous low level or short term
load information can be exchanged. However, this type of information vary at
a high rate. Extracted long term trends, which usually do not uctuate at a
high frequency, are good candidates.
Dierent approaches have been adapted for the state information policy, we
summarize some of them:
{ The limited Approach: when one processor is overloaded, a number of random processors are probed to nd a processor to which processes can be o
loaded. Simulation results have shown that this approach improves performance for a simple environment of independent, short-lived processes.
[Eager et al. 1986]
{ The Pairing Approach: in this approach [Bryant and Finkel, 1981], each
processor cyclically sends load information to each of its neighbors to pair
with a processor that diers greatly from its own. The load information
consists of a list of all local jobs, together with jobs that can be migrated
to the sender of the load message.
{ The Load Vector Approach: In this approach, a load vector is maintained
which gives the most-recently received value for a limited number of processors. The load balancing decisions are based on the relative dierence
between a processor's own load and those loads held in the load vector.
The load of a processor can be in one of three states: light, normal or heavy.
The load vector of a processor neighbors is maintained and updated when
a state transition occurs: [Ni, 1985] L N, N L, N H, H N. In order to reduce the number of messages sent the L load message is sent when N L
transition, only if previous state was heavy. In addition broadcasting only
the N H transitions and only notifying neighbors of H N transition when
process migration is negotiated reduces the number of messages sent.
{ The Broadcast Approach: in the Maitre d' systems [Bershad 1985], one
daemon process examines the Unix ve-minute load average. If the processor can handle more processes, it will broadcast this availability. Every
time the processor state changes, this information is broadcasted. This
method improves performance if the number of processors is small. Another alternative is to broadcast a message when the processor becomes
Draft: v.1, April 26, 1994
398
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
idle. This approach works eciently if the network uses a broadcast communications medium.
{ The Global System Load Approach: processors calculate the load on the
whole system and adjust their own load relative to this global value .
When the processor load diers signicantly from the average load, load
balancing is activated. The dierence should not be too small and also
not too large. In the Gradient model algorithm [Lin and Keller, 1987], the
global load is viewed in terms of a collection of distances from a lightly
loaded processor. The proximity (Wi) of a processor is calculated as its
minimum distance form a lightly loaded processor is determined as follows:
Wi = min
fd ; g = 0g if 9K jgK = 0
K iK k
Wi = Wmax; if 8K; gk = Wmax
Wmax = D(N ) + 1
D(N ) = maxfdi;J ; 8i; J 2 N g
The global load is them represented by a gradient surface;
Gs = (W1; W2; ::: Wn)
This method, gives a route to a lightly loaded processor with a minimum
cost.
The transfer policy: this component deals with the questions of deciding
under what conditions is the migration of a process to be considered and which
processes (large, small or newly arriving) are best t for migration.
Static threshold values have been used where processes are o loaded to other
processors when the load goes beyond a certain threshold which is chosen experimentally. On the other hand, an under loaded processor could seek to accept
processes form other peer processors when either the processor becomes idle or
when normal to light load transition occurs.
It is common to consider only newly-arriving processes however other processes
might benet more from the migration. [Kreuger and Finkel (1984)] proposed
the following:
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
399
1. Migration of a blocked process may not prove useful, since this may not
eect local processor load.
2. Extra overhead will be incurred by migrating the currently scheduled process.
3. The process with the best current response ratio can better aord the cost
of migration.
4. Smaller processes put less load on the communications network
5. The process with the highest remaining service time will benet most in
the long-term from migration.
6. Processes which communicate frequently with the intended destination
processor will reduce communications load if they migrate.
7. Migrating the most locally demanding process will be of greatest benet
to local load reduction.
Some of the above considerations might be conicting, for example, a small
process (in terms of size and computation time) might put less load on the
communication network at the same time a big process will benet most in the
long-term from migration, resolving these conicting solutions requires developing rules that incorporate transfer policy criterion.
Maintaining stability is also an important issue under this policy. One scenario
involves a task being migrated from one node to the other many times. However,
limiting the number of migrations to a predened value could be a conservative
approach if a task can aord these migrations.
The cooperation and location policy: it involves choosing methods of cooperation between processors to nd out where is the best location for migrating
a process. Due to the network conditions the levels of cooperation can be adjusted. A reasonable approach is to rely on cooperation and controlled state
information exchange in situations of low and medium trac and load conditions. On the other hand, nodes should refrain from information exchange in
conditions of high loads and trac and rely more on making as many decisions as possible locally, by applying methods of prediction and inferring using
uncertain or not up-to-date information [Pasqu86].
Methods of cooperation include the sender initiated (an overloaded processor
initiates locating a suitable under-loaded processor, among neighbours or across
the whole network), receiver initiated (the reverse of the previous method), or
the symmetric approach (which has both the sender and the receiver initiated
components). Many studies were carried out to compare these dierent schemes.
Draft: v.1, April 26, 1994
400
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
For an example, it was found that under low to moderate loading conditions
the sender initiated performed better, whereas the receiver initiated performed
better under heavy load [Eage85]. One major disadvantage of receiver initiated
algorithms is that it requires preemptive task transfers, which are expensive
because they usually require saving and communicating the current state of the
process. One might note that under low load conditions, the dierence between
the performance of a variety of strategies is not important. In addition, low
load conditions allow us the freedom to attempt some interesting strategies like
for instance, running the same job on many idle machines and simply waiting
for the fastest response.
1. Sender Initiated approaches: initiating load-balancing from an overloaded processor is widely studied. [Eager (1986)] studied 3 simple algorithms where the transfer policy is a simple static threshold:
(a) choose a destination processor at random for a process migrating from
a heavily-loaded processor (the number of transfers is limited to only
one).
(b) choose a processor at random and then probe its load. If it exceeds
a static threshold, another processor is probed and so on until one is
found in less than a given number of probes. Otherwise, the process
is executed locally.
(c) poll a xed number of processors, requesting their current queue lengths
and selecting the one with the shortest queue.
Performance of these algorithms were evaluated using K independent M/M/1
queues to model the no load balancing case. M/M/K queue to model the
the load balancing case. All algorithms provided improvement in performance. The threshold and the shortest queue provided extra improvement
when system load is beyond 0.5. This study showed that simple policies
are adequate for dynamic load balancing.
[Stankovic (1984)] proposed three algorithms which are based on the relative dierence between processor loads. The information exchange is
through periodically broadcasting local values:
(a) choose least-loaded processor if the load dierence is larger than a
given bias.
{ if the dierence > bias 1, migrate one process.
{ if dierence > bias 2, migrate two processes.
(b) similar to (a), except no further migration to that processor for a given
period 2t.
Estimating parameters (bias 1, bias 2 and 2t) is a dicult problem.
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
401
[Kreuger and Finkel, 1984] used a global-agreed average load value where
when when a processor becomes overloaded, it broadcasts this face. An
under loaded processor responds and indicates number of processes that
can be accepted and adjusts its load by the number of processes which it
believes it will receive. If the overloaded processor receives no response, it
assumes that the average value is too low and increases this global value
and then broadcasts it. This algorithm adapts quickly to uctuations in
the system load.
In the Gradient Model load balancing algorithm [Lin and Keller, 1987],
each processor calculates its own local load, if the local load is light, then
it propagates 0 pressure to its neighbors, if the load is moderate, the propagated pressure is set to one greater than the smallest value received from
all its neighbors. In this algorithm, global load balancing is by a series
of local migration decisions. Migration occurs only at heavy loads and
processes migrate toward lightly loaded processors.
2. Receiver Initiated approaches: [Eager et. al, 1985] have studied these
variations:
(a) When the load on one processor falls below some static threshold (T),
it polls random processors to nd one where if its process is migrated
the load becomes > T.
(b) similar to (a), but instead of migrating the currently running process,
a reservation is made to migrate the next newly-arriving process, provided there are no other reservations. Simulation results showed that
it does not perform as well as (a).
For broadcast networks, [Livny and Melman, 1982] proposed two receiverinitiated policies:
(a) A node broadcasts a status message when it becomes idle and receivers
of this message carry out the following actions:
Assuming ni denotes the number of processes executing on a processor
i:
i. If ni > 1 continue to step ii, else terminate algorithm.
ii. Wait D=n time units, where D is a parameter depending on the
speed of the communications subsystem; by making this value dependent on processor load, more heavily-loaded processors will
respond more quickly.
iii. Broadcasting a reservation message if no other processor has already done so (if this is the case terminate algorithm).
iv. Wait for reply
Draft: v.1, April 26, 1994
402
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
v. If reply is positive, and ni > 1, migrate a process to the idle
processor.
(b) This broadcast method might overload the communication medium,
so a second method is to replace broadcasting by polling when idle.
The following steps are taken when a processor's queue length reaches
zero.
i. Select a random set of R processors (ai; ::::; aR) and set a counter
j = 1.
ii. Send a message to processor aj and wait for a reply.
iii. The reply from aj will either be a migrating process or an indication that it has no processes.
iv. If the processor is still idle and j < R, increment j and go to step
ii else stop polling.
3. Symmetrically Initiated Approach: at low loads, if the receiver initiated policies are used, the load balancing activity is delayed since the
policy is not initiated as soon as a node becomes a sender. Under heavy
loads, the sender initiated policy is ineective since resources are wasted
in trying to nd an under loaded processor. A symmetrically initiated
approach, takes advantage of both policies at their corresponding peak
performance and therefore performing well under all load conditions.
Unstable conditions might result under the location policy. Under active load
uctuations, stable adaptive symmetric algorithms are needed where system
information is used to the fullest to adapt swiftly to the changing conditions
and to prevent unstable conditions. In addition, a situation might arise where
a number of nodes make the same choice of where to send their jobs. This is an
example of optimal local decisions rendered non-optimal when measured on a
global scale. This can be avoided in many ways, one involves making more than
one choice that are comparable in terms of quality and then choosing randomly
one of them to avoid conicts.
From the above discussion of the four load balancing components, we can determine some required characteristics of the possible solutions. These include the ability
to make inferences and decisions under uncertainty, the ability to hypothesize about
the state of the system and to perform tests and experiments to verify a hypothesis,
the ability to resolve conicting solutions and make a judgment of what is possibly
the best solution, the ability to reason with time and to check the lifetime of information, the ability to detect trends and learn from the past experiences, and the ability
to make complex decisions using available and predicted information under stringent
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
403
time limits. This leads us to the next section where we discuss probabilistic load
balancing.
9.3.3 Probabilistic Load Balancing
In performing load balancing, a node makes local decisions which depend on its own
local state and the states of the other nodes. In the distributed environment considered, the state information is shared by message passing which cannot be achieved
instantaneously. In addition, The state of the nodes are constantly changing. Therefore, acquiring up-to-date state information about all components of the network
(global system state) can be expensive in terms of communication overhead and of
maintaining stability on the network. A reasonable approach is to rely on cooperation
and controlled state information exchange in situations of low and medium trac and
load conditions. On the other hand, nodes should refrain from information exchange
in conditions of high loads and trac and rely more on making as many decisions as
possible locally, by applying methods of prediction and inferencing using uncertain
or not up-to-date information.
In general, for a typical local area network or cluster of nodes, a large amount
of data reecting the state of the network and its components is available. In such
systems, procedural programming becomes too cumbersome and too complex since
the programmer must foresee every possible combination of inputs and data values
in order to program code for these dierent states. The overhead of performing the
task of load balancing can be tolerated and the quality of service requirements of the
applications can be met, only if the process of decision making and data monitoring
is performed at a high rate. Therefore, the decision making process is fairly complex
mainly due to the uncertainty about the global, and to the large amounts of data
available and the diverse state scenarios that can arise, which require dierent possible
load balancing schemes. Analytically examining all the possible solutions is a fairly
time consuming task. Probabilistic scheduling where a number of dierent schedules
are generated probabilistically have been used Team decision theory can be applied
to the load balancing problem, where it is possible to cooperate and to make decisions
concerning the best schedules, in collaboration with other similar entities.
Uncertainty handling and quantication is at the heart of decentralized decision
theory as applied to the load balancing problem. Bayesian probability [Stank81],
where a priori probabilities and costs associated with each possible course of action are
used to calculate a risk function or a utility function. Minimizing the risk function or
maximizing the utility function, results in what is often referred to as a likelihood ratio
test where a random variable is compared to a threshold and a decision is made. In
Bayesian methods, point probabilities are applied qualify information with condence
measures and to determine the most likely state of the system. From the inferred
Draft: v.1, April 26, 1994
404
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
state, a scheduling decision can be made. This approach has its merits, since decision
theory and utility theory are based on point probabilities. In addition the probability
functions of the random variables needed in the decision making process, can be built
by observing the system in real time. Other possibilities for representing uncertainty
include the the intervals of uncertainty used in the Dempster-Shafer theory of belief
functions, and the linguistic truth-values used by fuzzy logic.
Once the uncertainty about the system state information is quantied and qualied, parametric models can built to represent the decision making process. However,
there isn't a single model that can represent such a complex system for all cases.
Instead, multiple models need to be constructed, each applied in specic settings.
Parameters for those models should adapt to the system state and can be learnt by
applying neural network technology. In addition, some information is unstructured
and cannot be generalized in such models. Heuristics and special case rules can be
applied to express such information. These facts has led to research in the use of
knowledge-based techniques, including expert systems where the case-specic knowledge is represented in the form of rules. A rule-based approach supports modularity
which adds more exibility to modifying or replacing rules. Another advantage to
rule-based programming is that the knowledge and the control strategy used to solve
the problem, can be explicitly separated in the rules, in comparison to procedural
programming where the control is buried in the code.
Some of the knowledge-based systems' characteristics include the following: [NASA91]
1. Continuous operation even when a problem arises. Diagnosis is done in conjunction with continued monitoring and analysis.
2. The ability to make inferences and to form hypotheses from simple knowledge
and to attach degrees of belief in them by testing them.
3. Context focusing where depending on the situation at hand, a context is dened,
to expedite decision making, where only certain rules and data is considered.
4. Predictability especially for complex situations, where it is dicult to predict
the amount of processing required to reach a conclusion. Therefore, in expert
systems, heuristics are available to ensure an acceptable conclusion, not necessarily the optimal, in a reasonable time.
5. Temporal reasoning which is the ability to reason with a distinction between
the past and the present.
6. Uncertainty handling where the expert system can validate data and reason
with uncertainty.
Draft: v.1, April 26, 1994
9.3. LOAD BALANCING
405
7. Learning capabilities where the expert system incorporates mechanisms for
learning about the system from historical data and for making decisions based
on trends.
8. Ability to cooperate and make decisions in collaboration with other similar
entities (team decision theory in distributed articial intelligence).
Draft: v.1, April 26, 1994
406
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
9.4 Special Issues/ Properties
Load balancing utilities must possess the following properties:
Eciency: the overhead of decision making, information exchange and job migration should not exceed the benets of load balancing.
Stability: as discussed in previous sections, maintaining stability is an important
consideration in all of the load balancing components.
Scalability: the overhead of increasing the pool of available processors should
be tolerable and should not aect the eciency of the system.
Congurability: a system administrator should be able to recongure the system
to the current needs of users and applications.
General purpose: the system should be unrestricted to service a diverse community of users and applications.
In addition to the above properties, we will discuss certain issues that are of
importance to load balancing.
Heterogeneity: as discussed in [Utopia], heterogeneity takes many forms. One
is congurational heterogeneity, where processors may dier in their capabilities
of processing power, disk and memory space, the architectural heterogeneity
which restricts code from being executed on dierent machine, and the operating system heterogeneity. Across a computer network, all three types of
heterogeneity might exist. A load balancing system should be able to deal with
this property and to take advantage of it. Taking advantage of congurational
heterogeneity is at the heart of the load balancing activity. Architectural heterogeneity should be made transparent to the user who should be able to use
all the dierent applications whether they are available on her/his machine or
not. And nally, the load balancing system, should have a uniform interface
with the operating system irrespective of the operating systems' heterogeneity.
Migration: migration is dened as the relocation of an already executing job
(preemptive as opposed to none preemptive where only newly arriving processes
are considered for transfer). There is an overhead involved in that process and
it includes saving the state of the process (open les, virtual memory contents,
machine state, buers' contents, current working directory, process ID etc...),
communicating this state to the new processor and migrating the actual process
and initiating it on the new processor. Many techniques are applied and they
Draft: v.1, April 26, 1994
9.4. SPECIAL ISSUES/ PROPERTIES
407
include entire virtual memory transfers, pre-copying, lazy copying (copy on reference) etc... So when is migration feasible? This questions has been addressed
in the literature [Krueger and Livny 1988], and the answer depends relatively
on what kind of le system is involved (shared le system, local secondary storage or/and replicated les). In general however, in a loosely coupled system
where communication is through message passing, the process state has to be
transferred via the network and in most cases it is very expensive task.
Decentralized Vs. Centralized an important aspect of the load balancing
problem is the issue of using either centralized or decentralized solutions. Both
approaches were studied in the literature. Decentralized solutions are more
popular due to the following reasons:
1. An inherent drawback in any centralized scheme, is that of reliability, since
a single central agent represents a critical point of failure in the network.
This is opposed to decentralized resource managers where if one fails, only
that portion of the resources on that specic machine become unavailable.
Also, centralized but replicated servers are more expensive to keep if they
are to survive individual crashes.
2. The load balancing problem itself may be a computationally complex task.
A centralized approach towards optimization ignores the distributed computational power inherent in the network itself and instead utilizes only
the computing power of a single central agent.
3. The information required of each step of an application may itself be distributed throughout the system. With a high rate of load balancing activities a centralized agent will become a bottleneck.
From the above, it is clear that in order to avoid the problems endemic to
centralized approaches, we should focus our design eorts on solutions that are
as decentralized as possible. Although decentralized solutions are usually more
complex, this complexity can be overcome. A reasonable approach which makes
use of the natural clusters that exist in computer networks, is to build semidecentralized systems in which one node in each cluster of nodes can act as a
central manager. Therefore o loading the nodes in the cluster from the tasks
of monitoring and information collection and distribution, but at the same time
achieving a relatively high degree of decentralization across the whole network
(many clusters).
Predictability: in a distributed system, the task of load balancing is more
complicated than in other types of distributed systems, because of the high
degree of predictability in the quality of service that is provided to the users.
Draft: v.1, April 26, 1994
408
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
For one thing, the process of harnessing all the computing power in a distributed
environment should be transparent to the user, to whom the system would
appear as a single time-shared machine. In addition, the user should not be
expected to make many changes if any to his/her code in order to be able to
use the load balancing facilities.
The CPU is considered the the most contended resource. The primary memory
is another resource to be considered. I/O congurations dier from one system
to another and therefore it is dicult to capture I/O requirements in a load balancing model. A node owner (whomever is working on the console of the node)
should not suer from less access to the resources (CPU and memory) than
was available by the autonomous node before load balancing was performed.
Conservative approaches are possible to preserve predictibility of service where
a node is considered available for foreign processes (those executing at a node
but not initiated by the owner of that node) only if no one is logged on the
console for a period of time. Therefore, any foreign process that is executing
when an owner begins using a node is preempted automatically and transferred
from the node to a holding area, from which it is transferred to an idle node,
when it is located. This conservative approach, is costly and is not ecient in
making use of the computing capacity of the distributed systems.
The goal here is to allocate local resources so that processes belonging to the
owner of the node get whatever resource they need, and foreign processes get
whatever is left. Therefore, a more exible approach to preserving the predictability of service is is to prioritize resource allocation locally on the owner's
node by performing local scheduling for the CPU and possibly the main memory. Such schemes have been studied in [The Stealth] and were shown to be very
eective at insulating owner processes from the eects of foreign processes and
at the same time such schemes make better use of nearly all unused capacities
and they also reduce overhead by avoiding unnecessary preemptive transfers.
Local scheduling is necessary for the ecient preservation of the predictability of
service and for the better use of unused capacity. Very little work has been done
in terms of local scheduling where foreign and local processes are distinguished
from one another. The local scheduling performed on time sharing uni-processor
system technologies can be applied in this area. Mechanisms for controlling the
priority of a process dynamically will depend on factors such as whether it is
executing on a local or a remote cite or if it is part of a parallel synchronous
application.
Draft: v.1, April 26, 1994
9.4. SPECIAL ISSUES/ PROPERTIES
409
9.4.1 Knowledge-based Load Balancing
In the previous section we discussed the characteristics of knowledge-based system
or expert systems which oer formidable solutions to the diculties encountered in
performing the load balancing tasks. In this subsection, we will discuss some of our
research ideas. We are investigating rule-based techniques for developing an expert
system environment, in which heuristic knowledge and real-time decision making
are used to manage complex and dynamic systems in a decentralized fashion. An
expert system manager integrates all the dierent schemes available for implementing
the dierent load balancing components in a knowledge base. By using the state
information contained in the data base (operating systems are full of statistics about
the state of the system's resources, queues etc ..., which are readily available) decisions
can be inferred, in an adaptive and dynamic manner, about the best possible schemes.
These selected schemes become the building blocks of the current load balancing
strategy.
In general we would like our environment to support the following: (1) monitoring
the state of the system, (2) constructing load balancing schemes dynamically from
already existing parts, (3) mapping the load balancing schemes onto the most ecient
and available architectures and (4) performing diagnosis on these schemes and initiating any necessary correcting measures. Having determined the features required of
our environment, it is now possible to design an architecture with the following main
components see Figure:
The data base which contains information about the distributed system in the
form of facts.
The knowledge base which has the rules that implement the four components
of dynamic load balancing schemes.
The load balancer which contains the system control.
The learner which observes past trends and formulates new facts and hypotheses
about the system, it can also perform experiments to test them.
The predictor which uses state information contained in the data base and estimates current states using analytical models of the system. In addition neural
networks can be used to perform predictions. The model or neural network
parameters can be determined experimentally and can be dynamically changed
to adapt to the varying conditions.
The critic which is developed to improve the collected knowledge, to doubt,
monitor and repair expert judgment about the system.
Draft: v.1, April 26, 1994
410
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
The explanation facility which explains the reasoning of the system, and the
knowledge acquisition facility which provides an automatic way to enter knowledge in the system. These facilities are accessed, by for example a system
administrator, via the external interface.
The main modes of functionality of our environment are the load balancing scheme
composition and execution. In the composition phase, the system state parameters
are obtained from the expert system data base. In addition, the application description is passed to the expert manager for each application or job. This description
includes both quantitative and qualitative parameters. Some applications might require a specic algorithm that is requested by an established name. Our system
should be able to synthesize this algorithm using libraries that describe it. Using the
application characterization, and the local and the global system state description,
the dierent load balancing components are recongured by establishing their respective thresholds and the levels of their functionality. Our goal is to devise methods
to select dynamically at run time, the best scheme to implement each component
according to the application requirements and the system state information. Each
selected scheme becomes a building block from which a nal load balancing strategy
that suits the application and network state is composed. See Figure.
As described before these policies include the method of measuring the load (how),
the periodicity of information collection (when), the type of processes or applications
to migrate (which) and how to cooperate and choose the execution locations (where).
The policies adapt to the environment by ring the appropriate rules for each of the
above components.
Some of these rules are as simple as initiating a state broadcast to other nodes
if the node decides it needs to and that the conditions on the network allow for
that. Other rules are more complex and are designed to resolve conict by attaching
weights or priorities to the conicting solutions, where the weights reect the dynamic
situation on the network.
In the execution phase, the resulting composed or requested scheme is executed.
In addition, with each job/ metajob, a job manager process is started to monitor the
execution and report the results to the application initiator. This process ends with
the completion of the application.
Draft: v.1, April 26, 1994
411
9.4. SPECIAL ISSUES/ PROPERTIES
Network
Critic
Data
Base
Learner
Predictor
Explanation
Facility
Network
Job Queue
Load
Optimizer
External
Interface
Dispatcher
Location
Policy
Node
Network
Load
measurement
Transfer
Policy
Info-Exchange
Policy
Knowledge Base
Knowledge
Acquisittion
Draft: v.1, April 26, 1994
412
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
Load
measurement
Info exchange
policy
Trasfer
policy
Cooperation
policy
S_1
S_2
S_1
S_2
S_1
S_2
S_1
S_2
S_3
S_3
S_3
S_3
S_#: Strategy #, a building block from each component
S_1
S_1
S_2
S_3
S_3
S_3
S_2
S_1
S_3
S_2
S_3
First
Load
Balancing
Second
Load
Balancing
Third
Load
Balancing
Scheme
Scheme
Scheme
S_1
Time
Draft: v.1, April 26, 1994
9.5. RELATED WORK
413
9.5 Related Work
9.5.1 J. Pasqual's Model (theoretical state space model)
In this work a decentralized control system is modeled as a directed graph with nodes
representing agents, and links representing inter-agent inuences. The model of an
agent Ai is given by an 8-tuple which are the Ai's state, Ai's generated work, the
work distribution, Ai's transferred work, Ai's global state information, the rest of the
agents' inuence on Ai (which is the work transferred from the rest of the agents
to Ai and Ai's global state information, Ai's decision space, Ai's decision rules, and
Ai's means of establishing the next state or the state transition. These tuples will be
dened in a more rigorous mathematical manner as needed in our discussion.
For a load balancing problem, the decision space D = d1; d2; :::; dK (assuming
each agent has the same decision space for simplicity), will contain such decisions
as transfer job or not, communicate information or note, request information or not
etc....
In general, it is desirable to make decisions that would minimize a general stepwise
loss function L(t) which consists of loss due decision quality degradation (due to aging
of information for instance), loss due to communication overhead, loss due to time
spent evaluating the decision rule and loss due to random eects because of the
stochastic nature of the system.
Low-level state information Xi is much too large and its level of detail is too
cumbersome to deal with (store, communicate etc ...) and is unnecessary. Therefore,
an agent uses an indicator, I (xi) which is a readily accessible portion of the low-level
state, such as the value of a single memory location in a computer (like the CPU
queue length ), or a small set of instructions which compute a value. Hence the
abstract state Yi can be dened as function of this indicator (like an average queue
length for instance).
Indicator values that change rapidly would not work well. However, a common
scenario is that the indicator values uctuate rapidly about a a more slowly changing
fundamental variation. If a sampled sequence of the indicator values are considered
and treated it as a time series, the high frequency components can be ltered out
leaving only the fundamental components by the use of moving average or autoregressive techniques (short-term averaging). These sampled, ltered out and rounded
values are called Bi (a discreet version of Yi). Bi is limited because of the niteness
of the memory of real machines up to Bmax. Homogeneous machines are assumed,
therefore they all have the same limitation. Also, it is assumed that the load does not
change in a continuous fashion, rather, it remains constant at some load level for an
unpredictable interval of time, after which it changes to a new load level. By taking
the a moving average of Bi over a certain number of periods a long term average or
Draft: v.1, April 26, 1994
414
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
a load level Li is achieved. A measure of the degree of variability Vi , of Bi about Li
is calculated by taking the moving average of absolute dierence between the two for
a certain number of past load values.
The model used has the Markovian property: the probability distribution of the
next state, given the past states and the current state depends only on the value of
the current state, not on those of past states. A rst order Markovian model can be
conveniently represented as a matrix of one-step transition probabilities P, with the
n-step transition matrix given by P n.
The model should allow an agent Ai to predict the possible states of agent Aj ,
based on Ai most recent reception of information about Aj . Agents exchange their
load levels and the degree of variability which is more valuable than simply the value
of Bi as that value may change very quickly. The state transition model is given by:
p(Bj (n) = j Lj (n ; k) = ; Vj (n ; k) = v) = [Pvk ]
where [Pvk ] is the element in row and column of the k-step state transition
probability matrix Pvk , Pv is the one ; step transition matrix of size Bmax Bmax,
where the probability of remaining in the same state after one transition is v (except
for states 0 and Bmax, for which this probability is 1+2v . v is a decreasing function
of Vj (hence the subscript v) which is determined experimentally.
This model is called the steady state model, in that in making any predictions it
assumes that the remote agent's load is at the same load level since the reception of
the load level information.
The model was validated using simulation. The parameters used were determined
experimentally from observing a distributed system. The results show improvement
on job response time up to %67.
9.5.2 Dynamic Load Balancing Algorithms
The dynamic load balancing investigated in this subsection is an algorithm proposed
by Anna Hac and Theodore J. Johnson. This load balancing algorithm is found in
a distributed system consisting of several hosts connected by a local area network.
The le system of which is modeled on the LOCUS distributed system. This le
system allows replication of les on many hosts. A multiple reader, single writer,
synchronization policy with a ccntralizcd synchronization site, is provided to update
remole copies of Liles. Their algorithm uses information collected in the system for
optimal process and read site placemcnt.
The model of a host and a distrihuted system with N hosts is shown in Figure 9.2.
The system is modelled after an open queuing local area network that consists of
Draft: v.1, April 26, 1994
9.5. RELATED WORK
415
several interconnected queued servers. Each CPU has a round-robin scheduling algorithm and the disks and networks service tasks as they arrive. The service time
distribulion for the CPU and disks is uniform.
This algorithm bases its process placcment decisions upon information about thc
distribution of work in the system. Periodically, a token is entered into the system,
collecting workload measurements on every host and distributing these measurements
to every host. The measurements it collects are: queue length of each server, percentage of use and the number of johs using each resource. When a request for execulion
of a job arrives in the distrihuted system, the execution site is selected to balance the
load on all hosts. If a remote host site is selected as the execution site, a message is
sent to the remote host telling it to start the job. Otherwise, the job is started on
the local host.
The Workload Model
For the simulation used to determine the eectiveness of the load balancing algorithm,
three types of workloads were used to cover the range of system service requiremenLs
of a real system. The three workload types simulaled were 1/0 bound, CPU bound
and and combination of CPU and I/O hound. There were eight process types found
in the simulation. Each process utilized some portion of the CPU, disk and network.
Some of the proccsses were, like the workload models, CPU bound, I/O bound or
some combination of both. The system was simulated by determining the sequence
of servers that each job type would visit and specify the servcrs to accommodate the
chains.
The algorithm that chooses the execution site and the read site uses vectors of
workloads and host characteristics for each host. For each possible selcction site, a
vector is constructed. Vectors are constructed such that the longer the vector, the
worse the the selection choice. The host with the shortest vector is chosen. All hosts
are included in the selection of an execution site. When choosing a selection site
for the process, the load balancing algorithm considers the workload characterization
and seeks to minimize system work. The workload characterization is determined by
the system data collected by the token which was passed through the system. For
process placement, the load balancing algorithm used CPU utilization information
from each processor. There are three considerations for process placement used by
the load balancing algorithm, "is the host being considered for job placement the
host that requested the job?", "is the job accessing a le, and is the le stored at
the host being considered?" and "is the job interactive and is the terminal at the
host being considered?". All these considerations go towards determining optimum
process placement. A weight is associated with each parameter, CPU utilization
and each of the three process placement considerations, when calculating the optimal
Draft: v.1, April 26, 1994
416
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
job departure
network request
high priority
job arrival
newtwork request
low priority
cpu
terminal 1
terminal n
disk
high priority
low priority
a) the model of a host
host
1
host
2
network
host
n
b) the model of a distributed system
Figure 9.2: The model of a host and of a distributed system
Draft: v.1, April 26, 1994
9.5. RELATED WORK
417
site placement. Some characteristics weigh more heavily in deterrnining the optimal
process placement[32].
The algorithm that chooses the read site placement the load balancing algorithm
considers the workload characterization and seeks to minimize system work in a similar way that the algorithm chooses the selection site. The workload characteristics
for read site placement are, disk queue length, disk utilization and the number ot jobs
accessing a le on the disk. The only system work minimization characteristic used
is, "is the le needed stored locally?".
The dimensions of the vectors correspond to factors that indicate the optimality
of that site for selection. These dimensions were scaled with weights used to tune
the algorithm. This was done to reLlect the relative importance of the dimensions
and to allow for dierences in the ranges ot the measurements. The tunable vector
dimensions were used to enable the ability to explore the factors and their signicance
in the process placement decision.
Given all the information collected about the system and the availability and
utilization ot local resources tor each processor the host selection algorithm is as
shown[32]:
1. for every host h being considered as a placement choice. 2. for every workload
characterisric being considered 3. w(h) = w(h)+((weightforworkloadcharacteristic)
(workloadcharacteristic)2 4. for every work minimization characteristic being considered 5. If the host being considered does not meet the work minimization condition,
then 6 w(h) = w(h) + (weightforworkminimizationcondition)2 7 Choose the host,
k, such that k = \k : w(k) = min(w(h))
Given the tunable nature of this algorithm the authors experimented heavily with
determining the best parameters for a balanced load.
00
00 00
An Adaptive Distributed Load Balancing Model
A multicomputer is represented by n processor nodes Ni, 0 < i < n, interconnected by
a network characterized by a distance matrix D = dij , where dij shows the number of
hops between node Ni and Nj . It is assumed that dii = 0 and dij = dji for all i and j .
The immediate neighborhood of node Ni is dened by the subset \Gi = Nj jdij = 1
(see Figure 9.3).
An adaptive model of distributed load balancing is shown in Figure ??. The host
processor is connected to all nodes. The load index li of Ni is passed from each node
Ni to the host and the system load distribution L = \\li j0 =< i < n\ , is broadcast to
all nodes on a periodic basis. All nodes maintain their own load balancing operations
independently and report their load indices to the host on a regular basis.
At each node Ni, a sender-initiated load balancing method is used, where heavily loaded nodes initiate migrating processes. The sender-initiated method has the
00 00
00
00
Draft: v.1, April 26, 1994
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
418
Host
Processor
I0
µ1
µ0 λ1
λ0
N0
O0
I1
N1
O1
µn-1
λn-1
....
In-1
N n-1
O n-1
Interconnection Network(Message Passing)
Figure 9.3: An adaptive load balancing model for a multicomputer with n processor
nodes and a host processor
advantage of faster process migration, as soon as the load index of a processor node
exceeds a certain threshold that is updated periodically according to the variation
of the system, load distribution. The distributed load balancing for each node Ni is
represented by the queuing model shown in Figure 9.5.
Heuristic Process Migration Methods
Four heuristic methods for migration processes are formally introduced below. These
heuristics are used to invoke the migrating process, to update the threshold, and
to choose the destination nodes for process migration. These methods are based
on using the load distribution Lt, which varies from time to time. Two attributes
(decision range and process migration heuristics) are used to distinguish the four
process migration methods.
1. Localized Round-Robin (LRR) Method Each node Ni uses the average
load among immediate neighboring nodes to update the threshold and only
migrates processes to the neighboring nodes. The Round-Robin discipline is
used to select a candidate node for process migration.
2. Global Round-Robin GRR) Method Each node Ni uses a globally determined threshold and migrates processes to any appropriate node in the system.
The selection from the candidate list for process migration is based on the
Round-Robin based discipline. After receiving the load distribution Lt from
Draft: v.1, April 26, 1994
9.5. RELATED WORK
Node 1N
419
I1
Q1
Decision
Maker
Server
λ1
µ1
Qm
O1
Figure 9.4: A queueing model for the dynamic load lalancing at each distributed
node.
N1
Service Q
D
S
Migration Q
N3
N0
Service Q
D
Service Q
S
Migration Q
D
Service Q
D
S
Migration Q
S
Migration Q
N2
Figure 9.5: An example of the open network load balancing model (D: decision maker,
S: server)
Draft: v.1, April 26, 1994
420
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
the host, the glopal threshold is set to the system average load among all the
nodes.
3. Localized Minimum Load (LML) Method The way to determinate the
threshold and to set up migration ports is the same as that in LRR. The dierence between LML and LRR is in the policy to select a destination node. At
node Ni, there is a load table to store the load index with the minimum load
index in the load table as a destination node. After a process is migrated to
the selected node, its load index in the load table is incremented accordingly.
4. Global Minimum Load (GML) Method The setup of the threshold and
migration ports is the same as that in GRR. But the destination node is selected
according to LML method. That is the node with the minimum load index in
the global load table will be chosen as the destination node.
These heuristic load balancing methods perform well as compared with no load
balancing. The relative merits of these four methods are discussed below. The LRR
and LML methods are based on the locality and have a short migration distance
among immediate neighbors. The GRR and GML methods are based on global
states and may experience much longer migration distance. The GRR and GML
have better performance when the mean service time is long, say around one second,
but the LRR and LML methods are better, when the mean service time is small.
Since the communication and migration overhead does not aect performance too
much when the mean service time is large, it does, however when S is small. Thus,
the system which has a lng mean service time can use global methods while the system
that has short service time jobs should use local methods.
9.6 Existing Load Balancing Systems
9.6.1 Utopia or Load Sharing Facility (LSF) (experimental/practical system)
LSF is a general purpose, dynamic, ecient and scalable load sharing system where
heterogeneous hosts are harnessed together into a single system which makes the best
use of the available resources.
The main properties of this system include its transparency to the user, where
the user does not suer from the fact that his/her resources are being shared. In
addition, the user does not have to change the programs written to be able to use
LSF or any resource that does not run on the user's machine.
Draft: v.1, April 26, 1994
9.6. EXISTING LOAD BALANCING SYSTEMS
421
Hierarchies of clusters are managed by master load information managers which
collects information and make it available for the hosts in its cluster. In addition, a
load information manager (LIM) and remote execution server (RES) reside on each
machine and provide the user with a uniform machine independent interface to the
operating system.
LSF supports running sequential or parallel jobs which are either interactive or
batch jobs. The load sharing tool (lstool) allows the users to write their own load
sharing applications as shell scripts. In addition, LSF contains an application load
sharing library (LSLIB) which users can use in developing their compiled programs
and applications. LSF allows new distributed applications to be developed where
users can nd the best machines for their jobs using comprehensive load and resource information that is available to them through LSF or by leaving it to LSF to
perform the load balancing automatically and transparently for them. In addition,
conguration parameters are available to system administrators to control the LSF
environment.
The system has been used commercially and has proven to be very ecient and
useful.
V-system
The V system uses a state change driven information policy. Each node broadcasts its
state whenever its state changes signicantly. State information consists of expected
CPU and memory utilization and particulars about the machine itself, such as its
processor type etc. The broadcast state information is cached by all the nodes.
The V system's selection policy selects only newly arrived tasks for transfer. Its
relative transfer policy denes a node as a receiver if it is one of the M most lightly
loaded nodes in the system, and as a sender if it is not. The decentralized location
policy locates receivers as follows: When a task arrives it consults the local cache and
constructs the set containing the M most lightly loaded machines that can satisfy the
task's requirement. If the local machine is one of the M machines, then the task is
scheduled locally. Otherwise a machine is chosen randomly from the set and is polled
to verify the correctness of the cached data. This random selection reduces the chance
that multiple machines will select the same remote machine for task execution. If the
cached data matches the machine's state (within a degree of accuracy), the polled
machine is selected for executing the task. Otherwise, the entry for the polled machine
is updated and the selection procedure is repeated. In practice the cache entries are
quite accurate, and more than three polls are seldom required. The V System's
load index is CPU utilization at a node. To measure CPU utilization, a background
process that periodically increments a counter is run at the lowest priority possible.
The counter is then polled to see what proportion of the CPU has been idle.
Draft: v.1, April 26, 1994
422
CHAPTER 9. LOAD BALANCING IN DISTRIBUTED SYSTEMS
Sprite
The Sprite system is targeted towards workstation environment. Sprite uses a centralized state change driven information policy. Each workstation, on becoming a
receiver, noties a central coordinator process. The location policy is also centralized: To locate a receiver, a workstation contacts the central coordinator process.
Sprite's selection policy is primarily manual. Tasks must be chosen by users for
remote execution and the workstation on which these tasks reside is identied as a
sender. Since the Sprite system is targeted towards an environment in which workstation's are individually owned, it must guarantee the availability of the workstation's
resources to the workstation owner. Consequenlty, it evicts foreign tasks from a
workstation whenever the owner wishes to use the workstation. During eviction, the
selection policy is automatic, and Sprite selects only foreign tasks for eviction. The
evicted tasks are returned to their home workstations.
In keeping with its selection policy, the transfer policy used in Sprite is not completely automated and it is as follows:
1. A workstation is automatically identied as a sender only when foreign tasks
executing at that workstation must be evicted for normal transfers, a node is
identied as a sender manually and implicitly when the transfer is requested.
2. Workstations are identied as receivers only for transfers of tasks chosen by
the users. A threshold policy decides that a workstation is a receiver when the
workstation has no keyboard or mouse input for at least thirty seconds and the
number of active tasks is less than the number of processors at the workstation.
To promote fair allocation of computing resources, Sprite can evict a foreign process from a workstation to allow the workstation to be allocated to another foreign
process under the following conditions: If the central coordinator cannot nd an idle
workstation for a remote execution request and it nds that a user has been allocated
more than his fair share of workstations, then one of the heavy users process is evicted
from a workstation. The evicted process may be transferred elsewhere if idle workstation became available. For a parallelized version of Unix"make", Sprite's designers
have observed a speedup factor of ve for a system containing 12 Workstations
Condor
Condor is concerned with scheduling long running CPU-intensive tasks(background
tasks) only. Condor is designed for a workstation environment in which the total
availability of a workstation`s resources is guaranteed to the user logged in at the
workstation console (the owner). Condor's selection and transfer policies are similar
to Sprite's in that most transfer's are manually initiated by users. Unlike sprite,
Draft: v.1, April 26, 1994
9.6. EXISTING LOAD BALANCING SYSTEMS
423
however, Condor is centralized, with a workstation designated as a controller. To
transfer a task, a user links it with a special system call library and places it in a
local queue of background jobs. The controller's duty is to nd idle workstations
for these tasks. To accomplish this Condor uses a periodic information policy. The
controller polls each workstation at two minute intervals. A workstation is considered
idle when only the owner has not been active forat least 12. 5 minutes. The controller
queues information about background tasks. If it nds an idle workstation it transfers
a background task to that workstation. If a foreign background task is being served at
a workstation, a local scheduler at that workstation checks for local activity from that
owner every 30 secs, if the owner has been active since the previous check, the local
scheduler preempts the foreign task and saves its state. The task may be transferred
later to an idle workstation if one is located by the controller.
Condor's scheduling scheme provides fair access to computing resources for both
heavy and light users. Fair allocation is managed by the "up-down" algorithm, under
which the controller maintains an index for each workstation. Initially the indexes
are set to zero. They are updated periodically in the following manner: whenever a
task submitted by a workstation is assigned to an idle workstation, the index of the
submitting workstation is increased. If, on the other hand the task is not assigned
to an idle workstation, the index is decreased. The controller periodically checks to
see if any new foreign task is waiting for an idle workstation. If a task is waiting,
but no idle workstation is available and some foreign task from the lowest priority
workstation is running, then that foreign task is preempted and the freed workstation
is assigned to the new foreign task. The preempted foreign task is transferred back
to the workstation at which it originated.
Draft: v.1, April 26, 1994
Bibliography
[1] Barak A. and Shiloh A. A distributed load balancing policy for a multicomputer.
SoftlDare-Practice and Experience, 15(09):901-913, Sep 1985.
[2] Tantawi A.N. and Towsley D. Optimal static load balancing on distributed
computing systems. IEEE transactions on Softutare Engineering, 17(2):133-140,
Feb 1991.
[3] Rommel C.G. The probability of load balancing success in a homogenous network. IEEE transactions on Soft10are Engineering, 17(09):922-933, Sep 1991.
[4] Zahorjan J. Eager D.L, Lazowska E.D. Adaptive load sharing in homogenous distributed systems. IEEE transactions on Softloare Engineering,
12(05):662675, May 1986.
[5] Chao-Wei Ou, Sanjay Ranka, and Georey Fox. Fast mapping and remapping
algorithm for irregular and adaptive problems. Technical report, July 1993.
[6] A. Pothen, H. Simon and K-P Liou 1990. Partitioning sparse matrices with
eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11, 3(July), 430-452.
[7] B. Hendrickson and R. Leland 1992. An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations. Sandia National Labs.
SAND921460.
[8] Nashat Mansour. Physical Optimization Algorithms for Mapping Data to
Distributed-Memory Multiprocessors. PhD Thesis.
[9] Mansour, N., and Fox, G.C. 1991. A hybrid genetic algorithm for task allocation.
Proc. Int. Conf. Genetic Algorithms (July), 466-473.
[10] R. D. Williams 1991. Performance of dynamic load balancing algorithms for
unstructured mesh calculations. Concurrency: Practice and Experience, 3(5),
457-481
425
426
BIBLIOGRAPHY
[11] Rahul Bhargava, Virinder Singh and Sanjay Ranka. A Modied Mean Field
Annealing Algorithm for Task Graph Partitioning. Technical Report, under
preparation.
[12] Douglis F and Ousterhout J. Transparent process migration:design alternatives and the sprite implementation. Software - Practice and Experience,
21(08):757785, Aug 1991.
[13] Livny M. and Melman M. Load balancing in homogenous broadcast distributed
systems. Proceedings of the ACM Computer Network Performance Symposium,
11(01):47-55, Apr 1982.
[14] Kreimien O. and Kramer J. Methodical analysis of adaptive load sharing algorithms. IEEE Transactions on Parallel and Distributed Systems., 3(06):747-760,
Nov 1992.
[15] Krueger P. and Livny M. The diverse objectives of distributed scheduling policies. Proc. of Seventh Int'l Comf. on Distributed Computing Systems, 801:242249, 1987.
[16] Krueger P. and Finkel R. An adaptive load balancing algorithm for a multicomputer. CS Dept.University of Wisconsin,Madison,Technical Report, 539,
Apr 1984.
[17] Zhou S. and Ferrari D. A trace driven simulation study of dynamic load balancing. IEEE Transactions on Software Engineering, 14(09):1327-1341, Sep 1988.
[18] Bokhari S.H. Dual processor scheduling with dynamic reassignment. IEEE
transactions on Softloare Engineering, 05(07):341-349, July 1979.
[19] Kruegger P Shivratri N. and Singhal M. Load distributing for locally distributed
systems. IEEE Computer, pages 33-44, Dec 1992.
[20] Kunz T. The inuence of dierent workload descriptions on a heuristic load
balancing scheme. IEEE transactions on Soft are Engineering, 17(07):725-730,
Jull991.
[21] Casavant T.L. and Kuhl J.G. Eects of response and stability on scheduling
in distributed computing systems. IEEE transactions on Softu are Engineering,
14(11):1578-1587, Nov 1988.
[22] Casavant T.L. and Kuhl J.G. A taxanomy of scheduling in general-purpose
distributed computing systems. IEEE transactions on Soft uare Engineering,
14(02):141-154, Feb 1988.
Draft: v.1, April 26, 1994
BIBLIOGRAPHY
427
[23] C. Gary Rommel, "The Probability ot Load Balancing Success in a Homogeneous Network", IEEE Transactions on Software Engineering, vol 17, pp. 922.
[24] K.W. Ross and D.D. Yao, "Optimal Load Balancing and Scheduling in a Distributed Computer System", Journal of the Association for Computing Machinery, vol 3X, pp. 676.
[25] R. K. Boel and J.H. Van Schuppen, "Dislributed Routing for Load Balancing",
Proceeding of the IEEE, vol 77, PE)- 212.
[26] A. Hac and T. J. Johnson, "Dynamic Load Balancing Through Process and
Read-Site Placement in a Distrihuted System", AT&T Technical Journal,
Sepl/Oct 1988, pp. 72.
[27] M. J. Berger and S. A. Bokhari, " A partitioning Strategy for non-uniform
Problems on Multiprocessors", IEEE Transactions on Computers, pp. 570-580,
C-26 1987.
[28] D. P. Bertsekas and J. N. Tsitsiklis, "Parallel and Distributed Algorithms",
Prentice-Hall Englewood Clis, NJ, 1989.
[29] Timothy Chou and Jacob Abraham, "Distributed Control of Computer Systems", IEEE Transactions on Computers, pp. 564-567, June 1986.
[30] George Cybenko, "Dynamic Load Balancing for Distributed Memory Multiprocessors", Journal of Parallel and Distributed Computing, pp. 279-301, October
1989.
[31] Kemal Efe and Bojan Groslej, "Minimizing Control Overheads in Adaptive
Load Sharing", IEEE Internal Conference on Distributed Computing, pp. 307315, 1989.
[32] Anna Hac and Theodore Johnson, "A Study of Dynamic Load Balancing in a
Distributed System", ACM, pp. 348-356, Febreary 1986.
[33] A. J. Harget and I. D. Johnson, "Load Balancing Algorithms in Loosely-Coupled
Distributed Systems: a Survey", Distributed Computer Systems, pp. 85-107,
[34] F. Lin and R. Keller, "Gradient model: A Demand-driven Load Balancing
Algorithm", IEEE Proceedings of 6th Conference of distributed Compudng,
pp. 329336, August 1986.
[35] Lionel Ni and Kai Hwang, "Optimal Load Balancing in a Multiple Processor
System with Many Job Classes", IEEE Transactions on Software Engineering,
pp. 491-496, May 1985.
Draft: v.1, April 26, 1994
428
BIBLIOGRAPHY
[36] E. de Souza e Silva and M. Gerla, "Load balancing in Distributed Systems with
Multiple Classes and Site Constraints:, Proc. Performance, pp. 17-34, 1984.
[37] Yang-Terng Wang and Robert Morris, "Load Sharing in Distributed Systems",
IEEE Transactions on Computers, pp. 204-217, March 1985.
[38] A. N. Tantawi and D. Towsley, "Optimal Static Load Balancing in Distributed
Computer Systems", pp. 445-465, Journal of the ACM, 32, No. 2.
[39] Jian Xu and Kai Hwang, "Heuristic Methodes for Dynamic Load Balancing in
A MessagePassing Supercomputer", IEEE, pp. 888-897, May 1990.
[40] J.Xu and K. Hwang, " Heudstic Methods for Dynamic Load Balancing in A
Message-Passing Supercomputer"
[41] G. Skinner, J.M. Wrabetz, and L. Schreier, "Resiurce Management In a Distributed Intemetwork Environment".
[42] A. Hac and T. J. Johnson, "A Study of Dynamic Load Balancing in a Distributed System",
[43] T.L. Casavant and J. G. Kuhl, "A Taxonomy of Scheduling in General-Purpose
Distributed Computing Systems"
Draft: v.1, April 26, 1994
Chapter 6
Distributed File Systems
Chapter Objectives
A file system is a subsystem of an operating system whose purpose is to organize,
retrieve, store and allow sharing of data files. A distributed file system is a distributed
implementation of the classical time-sharing model of a file system, where multiple users
who are geographically dispersed share files and storage resources. Accordingly, the file
service activity in a distributed system has to be carried out across the network, and
instead of a single centralized data repository there are multiple and independent storage
devices. The objectives of this chapter are to study the design issues and the different
implementations of distributed file systems. In addition, we give an overview of the
architecture and implementation techniques of some well known distributed file systems
such as SUN Network File System (NFS), Andrew File System, and Coda.
Keywords:
NFS, AFS, CACHE, FILE CASHING, TRANSPERANCY, CONCURENCY
CONTROL, LOCUS, ETC…
6.1 Introduction
The file system is part of an operating system that provides the computing system with
the ability to permanently store, retrieve, share, and manipulate stored data. In addition,
the file system might provide other important features such as automatic backup and
recovery, user mobility, and diskless workstations. The file system can be viewed as a
system that provides users (clients) with a set of services. A service is a software entity
running on a single machine [levy, 1990]. A server is the machine that runs the service.
Consequently, the file system service is accessed by clients or users through a welldefined set of file operations (e.g., create, delete, read, and write). The server is the
computer system and its storage devices (disks and tapes) on which files are stored and
from which they are retrieved according to the client requests. The UNIX time-sharing is
an example of a conventional centralized file system. A Distributed File System (DFS) is
a distributed implementation of the traditional time-sharing model of a file system that
enable users to store and access remote files in a similar way to local files. Consequently,
the clients, servers, and storage devices of a distributed file system are geographically
dispersed among the machines of a distributed system.
The file system design issues have experienced changes similar to the changes observed
in operating system design issues. These changes were mainly with respect to the number
of processes and users that can be supported by the system. Based on the number of
processes and users, file systems can be classified into four types [Mullender, 1990]: 1)
single-user/single-process file system; 2) single-user/multiple-processes file system; 3)
multiple-users/multiple-processes centralized time-sharing file system; and 4) multipleusers/multiple-processes geographically distributed file system.
The design issues in single-user/single-process file system include how to name files,
how to allocate files to physical storage, how to perform file operations, and how to
maintain the file system consistency against hardware and software failures.
When we move to a single-user/multiple-processes file system, we need to address
concurrency control issues and how to detect and avoid deadlock situations that result
from sharing resources. These issues become even more involved when we move to a
multiple-users/multiple-processes file system. In this system, we do need to address all
the issues related to multiple concurrent processes as well as those related to how to
protect and secure user processes (security). The main security issues include user
identification and authentication. In the most general type (multiple-users/multipleprocesses geographically distributed file system), the file system is implemented using a
set of geographically dispersed file servers. The design issues here are more complex and
challenging because the servers, clients, and network(s) that connect them are typically
heterogeneous and operate asynchronously. In this type that we refer to as a distributed
file system, the file system services need to have access and name transparency, fault
tolerance, highly available, secure, and high performance. The design of a distributed file
system that supports all of these features in an efficient and a cost-effective way is a
challenging research problem..
6.2 File System Characteristics and Requirements
The client applications and their file system requirements vary from one application to
another. Some applications run on only one type of computers, others run on a cluster of
computers. Each application type demands different requirements on the file system. One
could characterize the applications requirements for a file system in terms of the file
system role, file access granularity, file type, protection, fault-tolerance and recovery
[svobodova].
• File System Role. The file system role can be viewed in terms of two extremes: 1)
Storing Device, and 2) Full-scale filing system. The Storing device appears to the
users as a virtual disk that is mainly concerned with storage allocation, maintenance
of data objects on storage medium, and data transfer between the network and the
medium. The full-scale filing system provides all the services offered by the storing
device and additional functions such as controlling concurrent file accesses,
protecting user files, enforcing required access control, and directory service that
maps textual file names into file identifiers recognized by the system software.
• File Access Granularity. There are three main granularities to access data from the
file system: 1) File-Level Storage and Retrieval, 2) Page (block)-Level Access, and 3)
•
•
•
•
•
Byte-Level Access. The file access granularity affects significantly the latency and
sustainable file access rate. The appropriate access granularity depends on the type of
applications and system requirements; some applications need the file system to
support a bulk transfer of the whole files and thus the appropriate access mode is the
file-level access mode, while other applications require efficient random access to
small parts within a file and thus the appropriate access mode could be byte-level.
File Access Type. The file access mode can be broadly defined in terms of two
access models: Upload/download model and Remote Access. In the Upload/download
Model, when a client reads a file, the entire file is moved to the client’s host before
the read operation is carried out. At the end when the client is done with the file, it is
sent back to the server for storage. The advantage of this approach is that once the file
is loaded in the client’s memory, all file accesses are performed locally without the
need to access the network. However, the disadvantages of this approach are two
folds: 1) increasing the load on the network due to downloading/uploading entire
files, and 2) the client computer might not have enough memory to hold large files.
Consequently, this approach limits the size of the files that can be accessed.
Furthermore, experimental results showed that the majority of file accesses read few
bytes and then close the file; the life cycle of most files is within a few seconds [zip
parallel file system]. In the Remote Access Model, each file access is carried out by a
request sent through the network to the appropriate server. The advantages of this
approach are 1) the users do not need to have large local storage in order to access the
required files and 2) the messages are small and can be handled efficiently by the
network.
Transparency. Ideally, a distributed file system (DFS) should look to its clients like
a conventional, centralized file system. That is, the multiplicity and dispersion of
servers and storage devices should be transparent to the clients. Transparency
measures the system ability to hide the geographic separation and heterogeneity of
resources from the user and the application programmer so that the system is
perceived as a whole rather as a collection of independent resources. The cost of
implementing full transparency is prohibitive and challenging. Instead several types
of transparencies, that are less transparent than full transparency, have been
introduced such as network transparency and mobile transparency.
Network transparency: Network transparency allows clients/users to access remote
files using the same set of operations used to access local files. That means accessing
remote and local files become indistinguishable by users and applications. However,
the time it takes to access remote files is longer because of the network delay.
Mobile Transparency: This transparency defines the ability of the distributed file
system to allow users to log in to any machine available in the system, regardless of
the users locations; that means the system does not force users to login to specific
machines. This transparency facilitates user mobility by bringing the users'
environment (e.g., home directory) to wherever they log in.
Performance. In centralized file system, the time it takes to access a file depends on
the disk access time and the CPU processing time. In a distributed file system, a
remote file access has two more factors to be considered: the time it takes to transfer
and process the file request at the remote server and the time it takes to deliver the
requested data from the server to the client/user. Furthermore, there is also the
•
•
overhead associated with running the communication protocol on client and server
computers. The performance of a distributed file system can be interpreted as another
dimension of its transparency; the performance of remote file access should be
comparable to local file access [levy, 1990].
Fault-tolerance. A distributed file system is considered fault-tolerance if it can
continue providing its services in a degraded mode when one or more of its
components experience failures. The failure could be due to communication faults,
machine failures, storage device crashes, and decays of storage media. The
degradation can be in performance, functionality, or both. The fault-tolerance
property is achieved by using redundant resources and complex data structures
(transactions). In addition to redundancy, atomic operations and immutable files (the
files that can only be read but not written) have been used to guarantee the integrity of
the file system and facilitate fault recovery.
Scalability. It measures the ability of the system to adapt to increased load and/or to
addition of heterogeneous resources. The distributed file system performance should
degrade gracefully (moderately) as the system and network traffic increases. In
addition, the addition of new resources should be smooth with little overhead (e.g.,
adding new machines should not clog the network and increase file access time).
It is important to emphasize that the distribution property of a distributed file system that
makes the system fault-tolerance and scalable due to the inherent multiple redundant
resources in a distributed system. Furthermore, the geographic dispersion of the system
resources and its activities must be hidden from the users and made transparent. Because
of these characteristics, the design of a distributed file system is more complicated than
the file system for a single processor system. Consequently, the main issues to be
emphasized in this chapter are related to transparency, fault-tolerance, and scalability.
6.3 File Model And Organization
The file model addresses the issues related to how the file should be represented and what
types of operations can be performed on the file. The types of files range from
unstructured byte stream files to highly structured files. For example, a file can be
structured as a sequence of records or just simply a sequence of byte streams.
Consequently, different file systems have different models and operations that can be
performed on their files. Some file systems provide a single file model, such as a byte
stream as in the UNIX file system. Other file systems provide several file types (e.g.,
Indexed Sequential Access Memory (ISAM) and record in the VMS file system). Hence,
a bitmap image would be stored as a sequence of bytes in the UNIX file system while it
might be stored as one or two record file in the VMS file system.
The organization of the file systems can be described in terms of three modules or
services (see Figure 6.1): 1) Directory Service, 2) File Service, and 3) Block Service.
These services can be implemented as individually independent co-operative components
or all integrated into one software component. In what follow, we review the design
issues related to each of these three modules.
Client computer
Application
program
Application
program
Server computer
Directory service
Flat file service
Client module
Figure 6.1. File service architecture
6.3.1 Directory Service
The naming and mapping of files are provided by the directory service, which is an
abstraction used for mapping between text names and file identifiers. The structure of the
directory module is system dependence. Some systems combine directory and file
services into a single server that handles all the directory operations and file calls. Others
keep them separate, hence opening a file requires going to the directory server to map its
symbolic name onto its binary name and then passing the binary name to the file server to
actually read or write the file. In addition to naming files, the directory service controls
file access using two techniques: capability-based and identity-based techniques.
1. Capability-based approach: It is based on using a reference or name that acts as a
token or a key to access each file. A process is allowed to access a file when it
possesses the appropriate capability. For example, a Unique File Identifier (UFID)
can be used as a key or capability to maintain the protection against any
unauthorized access.
2. Identity-based approach: This approach requires each file to have an associated
list that shows the users and operations that are entitled to perform on that file.
The file server checks the identity of each entity requesting a file access to
determine if the requested file operation can be granted based on the user's access
rights.
6.3.2 File Service
The file service is concerned with issues related to file operations, access modes, file state
and how it is maintained during file operations, file caching techniques, file sharing, and
file replication.
•
File Operation: The file operations include open, close, read, write, delete, etc. The
properties of these operations define and characterize the file service type. Certain
applications might require all file operations to be atomic; an atomic operation is the
one when it is completed it is guaranteed to be performed successfully from the
beginning until the end, otherwise the operation is aborted without any side effect on
the system state. Other applications require files to be immutable; that means files
can be modified once they are created and therefore the set of allowed operations
include read, delete but no write operation. Immutable files are typically used to
simplify the recovery from faults and consistency algorithms.
•
File State: This issue is concerned whether or not file, directory, and other servers
should maintain state information about clients. Based on this issue, the file service
can be classified into two types: Statefull and Stateless File Service.
1. Statefull File Service: A file server is statefull when it keeps information on its client
state and then uses this information to process client file requests. This service can be
characterized by the existence of a virtual circuit between the client and the server
during a file access session. The advantage of statefull service is performance; file
information is cached in main memory and can be easily accessed, thereby saving
disk accesses. The disadvantage of statefull service is that state information will be
lost when the server crashes and this complicates the fault recovery of the file server.
2. Stateless Server: The server is stateless when the server does not maintain any
information on the client once it finished processing its file requests. A stateless
server avoids the state information by making each request self-contained. That is,
each request identifies the file and the position of the read/write operations. That is
there is no need to for Open/Close operations that are required in a statefull file
service.
The distinction between statefull and stateless service becomes evident when
considering the effects of a crash during a service activity. When the server
crashes, a statefull server usually restores its state by a following an appropriate
recovery protocol. A stateless server avoids this problem altogether by making
any file request be a self-contained and thus can be processed by any new
reincarnated server. In other circumstances when the client failure, the statefull
server needs to become aware of such failures to reclaim the space allocated to
record the state of the crashed clients. On the contrary, no obsolete state needs to
be cleaned up on the server side in a stateless file service. The penalty for using
the stateless file service is longer request messages and slower processing of file
requests since there is no in-core information to speed up the handling of these
requests.
The file service issues related to caching, replication, sharing, and tolerance will be
discussed in further detail next.
6.3.3 Block Service
The block service addresses the issues related to disk block operations and the allocation
techniques. The block operations can be implemented either as a software module
embedded within the file service or as a separate service. In some systems a network disk
server (e.g. Sun UNIX operating system) provides access to a remote disk block for
swapping and paging by diskless workstations. Separating the block service from the file
service offers two advantages: 1) Separate the implementation of the file service from
disk specific optimizations and other hardware concerns. This allows the file service to
use a variety of disks and other storage media; and 2) Support several different file
systems that can be implemented using the same underlying block service.
6.4 Naming and Transparency
In a DFS, a user refers to a file by a textual name. Naming is the means of mapping
textual names (logical names) to physical devices. There is a multilevel mapping from
this name to the actual blocks in a disk at some location, which hides from the user the
details of how and where the file is located in the network. Furthermore, to improve file
system availability and fault-tolerance, the files can be stored on multiple file servers. In
this case, the mapping between a logical file name and the actual physical name returns
multiple physical locations that contain a replica of the logical file name. The mapping
task in a distributed file system is more complicated than a centralized file system
because of the geographic dispersion of file servers and storage devices.
The most naive approach to name files is to append the local file name to the host name
at which the file is stored as is done in the VMS operating system. This scheme
guarantees that all files have unique names even without consulting other file servers.
However, the main disadvantage of this approach is that files can not be migrated from
one system to another; if the file must be migrated, its name needs to be changed and
furthermore all the file users must be notified. In this subsection, we will discuss
transparency issues related to naming files, naming techniques and implementation
issues.
6.4.1 Transparency Support
Ideally, a distributed file system (DFS) should look to its clients like a conventional,
centralized file system. That is, the multiplicity and dispersion of servers and storage
devices should be transparent to the clients. Transparency measures the system ability to
hide the geographic separation and heterogeneity of resources from the user and the
application programmer, so that the system is perceived as a whole rather as a collection
of independent resources. The cost of implementing full transparency is prohibitive and
challenging. Instead several types of transparencies, that are less transparent than full
transparency, have been introduced. We discuss transparency issues that attempt to hide
location, network, and mobility. In addition, we address the transparency issues related to
file names and how to interpret them in a distributed computing environment.
•
Naming transparency: Naming is a mapping between logical and physical objects.
Usually, a user refers to a file by a textual name. The latter is mapped to a lower-level
numerical identifier, which in turn is mapped to disk blocks. This multilevel mapping
provides users with an abstraction of file that hides the detail of how and where the
file is actually stored on the disk. In a transparent DFS, anew dimension is added to
the abstraction, that of hiding where in the network the file is located. The naming
transparency can be interpreted into two notions:
1. Storage location.
2. Location Independence. The name of a file needs not be changed when the file's
physical storage location changes.
A Location-independent naming scheme is a dynamic mapping, since it can map the
same file name to different locations at different instances of time. Therefore, location
independence is a stronger property than location Transparency. When referring to
the location independent, one implicitly assumes that the movement of files is totally
transparent to users. That is, files are migrated by the system without the users being
aware of it.
In practice, most of the current file systems (e.g., Locus, Sprite) provide a static,
location transparent mapping for user-level names. Only Andrew and some
experimental file systems support location independence.
•
•
Network transparency: Clients need to access remote files using the same set of
commands used to access local files; that means there is no difference in the
commands used to access local and remote files. However, the time it takes to access
remote files will be longer because of the network delay. Network transparency hides
the differences between accessing remote and local files so they become
indistinguishable by users and applications.
Mobile Transparency: This transparency defines the ability of the distributed file
system to allow users to log in to any machine available in the system, regardless of
the users locations; that means the system does not force users to login to specific
machines. This transparency facilitates user mobility by bringing the users'
environment to wherever they log in.
The file names should not reveal any information about the location of the files and
furthermore their names should not be changed when the files are moved from one
storage location to another. Consequently, we can define two types of naming
transparencies: Location Transparency and Location Independence.
• Location Transparency. The name of a file does not reveal any information about its
physical storage location. In location transparency, the file name is statically mapped
into a set of physical disk blocks, though hidden from the users. It provides users with
the ability to share remote files as if they were local. However, sharing the storage is
complicated because the file name is statically mapped to certain physical storage
devices. Most of the current file systems (e.g., NFS, Locus, Sprite) provide location
transparent mapping for file names [levy, 1990].
• Location Independence. The name of a file needs not be changed when it is required
to change the file’s physical location. A Location-independent naming scheme can be
viewed as a dynamic mapping, since it can map the same file name to different
locations at different instances of time. Therefore, location independence is a stronger
property than location transparency. When referring to the location independent, one
implicitly is that the movement of files is totally transparent to users. That is, files are
migrated by the system to improve the overall performance of distributed file system
by balancing the loads on its file servers without the users being aware of it. Only few
distributed file systems supported location independence (e.g., Andrew and some
experimental file systems).
6.4.2 Implementation Mechanisms
We will review the main techniques used to implement naming schemes in distributed
file systems such as pathname translation, mounting scheme, unique file identifier, and
hints.
•
Pathname Translation: In this method, any logical file defined as a path name (e.g.,
/level1/level2/filename) and it is translated by recursively looking up the low-level
identifier for each directory in the path, starting from the root (/). If the identifier
indicates that the sub-directory (or the file) is located in another machine, the lookup
procedure is forwarded to that machine. This is done till the identity of the machine
that stores the requested file has been identified. This machine then returns to the
client who requested the file the low-level identifier of that file (filename) in the
machine's file system. In some DFS, such as NFS and Sprite the request for the file
name lookup is passed on from one server to another server until the server that stores
the requested file is found. In Andrew file system, each state of the lookup procedure
is performed by the client. This option is scalable because the server is relieved from
performing the lookup procedure needed to translate client file access requests.
•
Mount Mechanisms: This scheme provides means to attach remote file systems (or
directories) to a local name space, via a mount mechanism, as in Sun's NFS. Once a
remote directory is mounted, its files can be named independent of the file’s location.
This approach enables all the machines on the network to specify the part of the file
name space (such as executables and home directories of users) that can be shared
with other machines and at the same time keeps machine specific directories local.
Consequently, each user can access local and remote files according to its naming
tree. However, this tree might be different from one computer to another and thus
accessing any file is not independent of the location of file requests.
•
Unique File Identifier. In this approach, we have a single global name space that is
visible to machines and it spans all the files in the system. Consequently, all files are
accessed using a single global name space. This approach assigns each file a Unique
File Identification (UFID) that is used to address any file in the system regardless of
its location. In this method, any file is associated with a component unit. All files in a
component unit are located at the same storage. Any file name is translated into a
UFID that has two fields. The first field contains the component unit to which the file
belongs. The second field is a low-level identifier into the file system of that
component unit. At run time, we maintain a table indicating the physical location of
each component unit. Note that this method is truly location independent, since we
associate files with component units whose actual location is unspecified, except at
bind-time.
There are a number of ways to ensure the uniqueness of the UFIDs associated with
different files. [Needham and Herbert, 1982; Mullender, 1985; Leach, 1983] all gave
importance on using relatively large sparsely populated space to generate UFIDs. To
achieve the uniqueness in UFIDs we can concatenate a number of identifying
numbers and/or a random number to ensure further security. This can be done by
concatenating the host address of the server creating the file with a file representing
the position of the UFID in the chronological sequence of UFIDs created by that
server. An extra field containing a random number is embedded into each UFID in
order to combat any possible attempt of counterfeiting. This ensures the distribution
of the valid UFIDs is sparse; also the length of the UFID is so long, that makes
unauthorized access practically impossible. Figure 6.2 shows the format of a UFIDs
that is represented as a 12-bit records.
UFID
1
2
3
FCB
23422
5465
65842
location on disk, size,..
…
…
Figure 6.2. This figure shows the UFID with the long
identification number to avoid unauthorized access
In the above format, the server identifier is considered to be an internet address,
ensuring uniqueness across all registered internet-based systems. Access control to
file is based upon the fact that a UFID constitutes a 'key' or capability to access a file.
Actually access control is a matter of denying UFIDs to unauthorized clients. When a
file is shared over a group, then the owner of the file holds all the rights on the file i.e.
he/she can perform all types of operations, whether it is read, write, delete, truncate
etc. But the other members of the group hold lesser rights on the file e.g. they can
only read the file, but they are not authorized to perform the other operations. Most
refined form of access control can be done by embedding a permission field in the
UFID, which encodes the access rights that the UFID confers upon its processor. The
permission field must be combined with the random part of before giving the UFID to
users. The use of the permission field should be done carefully so that its not easily
accessible. Otherwise the access rights can easily be changed, e.g. from read to write
etc. Whenever a file is created, the UFID is returned to the owner (creator) of the file.
When the owner gives the access rights to other users, it is necessary that some rights
are taken away to restrict the capabilities of the other users, by a function in the file
server meant for restricting capabilities. The two different ways suggested in
[Coulouris and Dolimore, 1988] by which a file server may hide its permission field
is as follows:
1. The permission field and the random part are encrypted with a secret key issued to
clients. When client present UFIDs for file access, the file server uses the secret key
to decrypt them.
2. The file server may encrypt the two fields by using a one-way function to produce
UFIDs issued to clients. When clients present the UFIDs for file access, the file
server applies the one-way function to its copy of the UFID and compares the result
with the client's UFID.
•
Hints: Hint is a technique often used for quick translation of file names. A hint is a
piece of information that directly gives the location of the requested file and thus it
speeds up performance if it is correct. However, it does not cause any semantically
negative effects if it is incorrect. Since looking up path names is a time consuming
procedure, especially if multiple directory servers are involved. Some system
attempts to improve their performance by maintaining a cache of hints. When a file is
opened, the cache is checked to see if the path name is there. If so, the directory-bydirectory lookup is skipped and the binary address is taken from the cache.
6.5 File Sharing Semantics
The file semantics of sharing are important criteria for evaluating any file system that
allows multiple clients to share files. When two or more users share the same file, it is
necessary to define the semantics of reading and writing to avoid problems such as data
inconsistency or deadlock. The most common types of sharing semantics are 1) Unix
Semantics, 2) Session Semantics, 3) Immutable Shared Semantics, and 4) Transaction
Semantics.
6.5.1 UNIX Semantics
This semantics makes writes to an open file by a client are visible immediately to other
(possibly remote) clients who have had this file open at the same time. When a READ
operation follows a WRITE operation, the READ returns the value just written.
Similarly, when two WRITEs happen in quick succession, followed by a READ, the
value read is the value stored by the last write. It's possible for clients to share the pointer
to the current file location. Thus advancing the pointer by one client affects all sharing
clients. The system enforces an absolute time ordering on all operations and always
returns the most recent value. In a distributed system, UNIX semantics can be easily
achieved as long as there is only one file server and clients do not cache files; all READs
and WRITEs go directly to the file server, which processes them strictly sequentially.
This approach gives UNIX semantics. The sharing of the location pointer is needed
primarily for compatibility of the distributed UNIX system with conventional UNIX
software. In practice, the performance of a distributed system in which all file requests
must be processed by a single server is frequently poor. This problem is often solved by
allowing clients to maintain local copies of heavily used files in their private caches.
6.5.2 Session Semantics
This semantics makes Writes to an opened file visible immediately to local clients but
invisible to remote clients who have opened the same file simultaneously. Once a file is
closed, the changes made to it are visible only in later starting sessions. However, these
changes will not be reflected in the already opened instances of the file. Using session
semantics raises the question of what happens if two or more clients are simultaneously
caching and modifying the same file. When each file is closed, its value is sent back to
the server, the client that closes the last will overwrite the pervious write operations and
thus the updates of the previous clients will be lost.
A good example for that are the yellow pages. Every year, the phone company produces
one Telephone book that lists the business and customers’ numbers. It is a database with
certain information that is updated once a year. The granularity is on annual basis. The
yellow pages are not accurate during the year because it is updated at the end of the
session. Accuracy in such examples is not a big issue because most of the customers will
search for a business in the area not for a certain business. The application, in this case,
relies on simplicity with a sacrifice of the accuracy.
6.5.3 Immutable Shared File Semantics
This semantics states that a file can be opened and read only. That is, once a file is
created and declared as a shared file by the creator, it cannot be modified any more.
Clients cannot open a file for writing. What a client can do if it has to modify a file is to
create an entirely new file and enter it into the directory under the name of a previous
existing file, which now becomes inaccessible. Just like session semantics, when two
processes try to replace the same file at the same time, either the latest one or nondeterministically one of them will be chosen to be the new file. This approach makes file
implementation quite simple since the sharing is in read-only mode.
6.5.4 Transactions Semantics
This semantics indicates that the operations on a file or a group of files will be performed
indivisibly. This is done by having the process declares the beginning of the transaction
using some type of BEGIN TRANSACTION primitive; this signals that what follows must
be executed indivisibly. When the work has been completed, an END TRANSACTION
primitive is executed. The key property of this semantics is that the system guarantees
that all the calls contained within a transaction will be carried out in order, without any
interference from other concurrent transactions. If two or more transactions start up at the
same time, the system ensures that the final result is the same as if they were all run in
some (undefined) sequential order.
6.6 Fault Tolerance And Recovery
Fault tolerance is an important attribute in distributed system that can be supported
because of the inherent multiplicity of resources. There are many methods to improve the
fault tolerance of a DFS. Improving availability and the use of redundant resources are
two common techniques to improve the fault tolerance of a DFS.
6.6.1 Improving Availability
A file is called available if it can be accesses whenever needed. Availability is a fragile
and unstable property. It varies as the system's state changes. On the other hand, it is
relative to a client; for one client a file may be available, whereas for another client on a
different machine, the same file may be unavailable. Before discuss the availability of a
file, we define two file properties first: "A file is recoverable if is possible to revert it to
an earlier, consistent state when operation on the file fails or is aborted by the client. A
file is called robust if it is guaranteed to survive crashes of the storage device and decays
of the storage medium." A file is called available if it can be accessed whenever needed,
despite machine and storage device crashes and communication faults. Availability is
often confused with robustness, probably they both can be implemented by redundancy
techniques. A robust file is guaranteed to survive failures, but it may not be available
until the fault component has recovered. Availability is a fragile and unstable property.
First, it is temporal; Availability varies as the system's state changes. Also, it is relative to
a client; for one client a file may be available, whereas for another client on a different
machine, the same file may be unavailable.
Replicating files can enhance the availability [Thompson, 1931] , however, merely
replicating file is not sufficient. There are some principles destined to ensure increased
availability of the files described below.
•
•
•
•
The number of machines involved in a file operation should be minimal, since the
probability of failure grows with the number of involved parties.
Once a file has been located there is no reason to involve machines other than the
client and the server machines. Identifying the server that stores the file and
establishing the
client-server connection is more problematic. A file location
mechanism is an important factor in determining the availability of files.
Traditionally, locating a file is done by a pathname traversal, which in a DFS may
cross machine boundaries several times and hence involve more than two machines
[Thompson, 1931]. In principle, most systems, e.g., Locus, Andrew, approach the
problem by requiring that each component, i.e., directory, in the pathname would be
looked up directly by the client. Therefore, when machine boundaries are crossed, the
serve in the client-server pair changes, but the client the same.
If a file is located by pathname traversal, the availability of a file depends on the
availability of all the directories in its pathname. A situation can arise whereby a file
might be available to reading and writing clients, but it cannot be located by new
clients since a directory in its pathname is unavailable. Replicating top-level
directories can partially rectify the problem and is indeed used in Locus to increase
the availability of files.
Caching directory information can both speed up the pathname traversal and avoid
the problem of unavailable directories in the pathname (i.e., if caching occurs before
the directory in the pathname becomes unavailable). Andrew uses this technique. A
better mechanism is used by Sprite. In Sprite, machines maintain prefix tables that
map prefixes of pathnames to the servers that store the corresponding component
units. Once a file in some component unit is open, all subsequent Opens of files
within that same unit address, the right server directly, without intermediate lookups
at other servers. This mechanism is faster and guarantees better availability.
6.6.2 File Replication
Replication of files is a useful scheme for improving availability, reducing
communication traffic in a distributed system and improving response time. The
replication schemes can be classified into three categories: the primary-stand-by, the
modular redundancy, and the weighted voting [Yap, Jalote and Tripathi, 1988] and
[Bloch, Daniels and Spector, 1987].
•
•
Primary-stand-by: It selects one copy from the replica and designates it as the
primary copy, whereas the others are standbys. Then all subsequent requests are sent
to the primary copy only. The stand-by copies are not responsible for the service, and
they are only synchronized with the primary copy periodically. In case of failure, one
of the standbys copies will be selected as the new primary copy, and the service goes
on.
Modular Redundancy: This approach makes no distinction between the primary
copy and standby ones. Requests are sent to all the replica simultaneously, and these
requests are performed by all copies. Therefore, a file request can be processed
regardless of failures in networks and servers provided that there exists at least one
accessible correct copy. This approach is costly to maintain the synchronization
•
between the replica. When the number of replica increases, the availability decreases,
since any update operation will lock all the replica.
Weighted Voting: In this scheme, all replica of a file, called representatives, are
assigned a certain number of votes. Accesses operations are performed on a collection
of representatives, called access quorum. Any access quorum which has a majority of
the total votes of all the representatives is allowed to perform the access operation.
Such a scheme has the maximum flexibility where the size of the access quorum can
change for various conditions. On the other hand, it may be too complicated to be
feasible in most practical implementation.
A variant model [Chung, 1991] which combines the modular redundancy and primarystand-by approaches provides more flexibility with respect to system configuration. This
model divides all copies of a file into several partitions. Each partition functions as a
modular redundancy unit. One partition is selected as primary and the other partiions are
backups. In this manner, it reaches the balance in the trade-off between the modular
redundancy and primary-stand-by approaches. An important issue in file replication is
how to determine the file replication level and the allocation of the replicated file copy
necessary to achieve satisfactory system performance. There are three strategies in
solving the file allocation problem (FAP).
•
Static File Allocation: It is intuitive that the replica are firmly allocated in specified
sites. Based on the assessment of file access activity levels, cost and system
parameter values, the problem involves allocating file copies over a set of
geographically dispersed computer so as to optimize an optimality criterion, while
satisfying a variety of system constrains. Static file allocations are for systems which
have a stable level of file access intensities. The optimality objectives used in past
include system operating costs, transaction response time and system throughput.
Essentially, static file allocation problems are formulated as combinatorial
optimization models where the goal addresses the allocation tradeoffs in term of the
selected optimality criterion. Investigations of static file allocation problems were
pioneered by W.W. Chu [Chu, 1969]. Since the FAP is NP-complete, much attention
has been given to the development of heuristics that can generate good allocations
with lower computational complexity. Branch-and-bound and graph searching
methods are the typical solution techniques to avoid enumeration of the entire
solution space.
•
Dynamic File Allocation: If file system is characterized with high variability in
usage patterns, the use of static file allocation will degrade the performance and the
cost increases throughout the operational period. Dynamic file allocation is based on
anticipated changes in the file access intensities. Of course, the file reallocation costs
incurred in this scheme have to be taken into consideration in the initial design
process. The dynamic file allocation problem is one of determining file reallocation
policies over time. File reallocations involve simultaneous creation, relocation and
deletion of file copies. Dynamic file allocation models can be classified as nonadaptive and adaptive. Initial research focused on non-adaptive models, while more
recent studies have concentrated on adaptive policies. In most recent research of
adaptive model on the dynamic FAP yielding lower computational complexity, it is
achieved by restricting the reallocations to single-file reallocations only. To improve
the applicability of the research results of dynamic FAPs, it is necessary to study the
problem structure under realistic schemes for file relocations in conjunction with
effective control mechanisms and to develop specialized heuristics for practical
implementations.
•
File Migration: This is also referred to as file mobility or location independence. The
main difference between dynamic FAP and file migration is in the operations used to
change a file assignment. Dynamic file reallocation models rely heavily on prior
usage patterns of the system database. File migrations are not very sensitive to prior
estimates on systems usage patterns. They automatically react to temporary changes
in access intensity by making the necessary adjustments in file locations without
human management or operator intervention.
Dynamic FAP considers file reallocations that might involve reallocating multiple
replica. These major changes could result in system-wide interruptions of information
services. In the file migration problem, each migration operation deals with only a single
file copy. Evaluation of file migration policies has been investigated by several
researchers [Gavish, 1990]. Since the file migration dealing with only a single file copy,
an individual file migration operation might be less effective than a complete file
reallocation in improving system performance. However, selecting an optimal or nearoptimal single operation is less complex than determining complete file reallocations.
Therefore, file migration can be invoked more frequency, thereby responding to system
changes more rapidly than file reallocation.
6.6.3 Recoverability
A file server should always ensure that the files it holds are always be accessible after the
failure of the system. The effect of failure in distributed system is much more pervasive
with respect to its centralized counterpart due to the fact that the clients and the servers
may fail independently and therefore there is a greater need of designing a server
computer that can restore data after the system failure and save the same from permanent
loss. In both the conventional and distributed system, disk hardware and driver software
can be designed to ensure that if the system crashes during a block write operation, or a
data transfer occurring during a block transfer, partially written data or incorrect data are
detected.
In XDFS the use of stable storage is worth mentioning here. It is a redundant storage for
structural information, which is implemented as a separate abstraction provided by the
block service operation. It is basically a means to revive data from permanent loss after a
system failure during a disk write operation or damage to any single disk block.
Operations on stable blocks are implemented using two disk blocks which holds the
content of each stable block in duplicate. This implementation is developed by Lampson
[Lampson, 1981], who defined a set of operations on stable blocks which mirror the
block service operation, the block pointers indicate that the stable storage blocks are to be
distinguished from the ordinary blocks. Generally it is expected that the invariant
duplicates of the stable storage, are stored in two different disk drives to ensure that the
blocks are not damaged simultaneously in a single failure, so each block acts as a back-up
to the other block. To maintain invariance for each pair of block:
•
•
Not more than one pair of block is bad.
If both are good, they have the most recent data except during the execution of stable
put.
The procedure Stable get operation reads one of the blocks using get block. The other
representative is read when a error condition is detected. If during a stable put procedure
a server crashes or halts, a recovery process is invoked while the server is re-started. The
recovery procedure is meant to maintain invariance by inspecting the pair of blocks and
doing the following: When
•
•
•
Both good and the same: nothing.
Both good and different: Copies one block of the pair to the other block of the pair.
One good and one bad: Copies the good block to the bad block.
6.7 File Cashing
Caching is a common technique used to reduce the time it takes for a computer to retrieve
information. The term cache is derived from the French word cacher, meaning "to hide.
Ideally, recently accessed information is stored in a cache so that a subsequent repeat
access to that same information can be handled locally without additional access time or
burdens on network traffic When a request for information is made, the system's caching
software takes the request, looks in the cache to see if it is available and, if so, retrieves it
directly from the cache. If it is not present in the cache, the file is retrieved directly from
its source, returned to the user, and a copy is placed in cache storage. Caching has been
applied to the retrieval of data from numerous secondary devices such as hard and floppy
disks, computer RAM, and network servers.
Caching techniques are used to improve the performance of file access. The performance
gain that can be achieved depends heavily on the locality of references and on the
frequency of read and write operations. In a client-server system, each with main memory
and a disk, there are four potential places to store(cache) files: the server's disk, the
server's main memory, the client's disk, and the client's main memory.
The server's disk is the most straightforward place to store all files. Files on the server's
disk are accessible to all clients. Since there is only one copy of each file, no consistency
problems arise. However, the main drawback is performance. Before a client can read a
file, the file must first be transferred from the server's disk to the server's main memory,
and then transferred over the network to the client's main memory.
Server
Disk
C
Server
A
B
Client Disk
Client
D
Internet
Figure 2. Four file storage structures. Case a, when the storage in the server disk
Case B, where the storage in the Server memory. Case C, where the storage
is in the Client memory and D in the Client Disk
Caches have been used in many operating systems to improve file system performance.
Repeated accesses to a block in the cache can be handled without involving the disk. This
feature has two advantages. First, caching reduces delays; a block in the cache can
usually be returned to a waiting process five to ten times more quickly than one that must
be fetched from the disk. Second, caching reduces contention for the disk arm, which
may be advantageous if several processes are attempting to access simultaneously files on
the same disk. However, since main memory is invariably smaller than the disk, when the
cache fills up and some of the current cached blocks must be replaced. If an up-to-date
copy exists on the disk, the cache copy of the replaced cache block is just discarded.
Otherwise, the disk is first updated before the cached copy is discarded.
A caching scheme in a distributed file system should address the following design
decisions:
•
•
•
•
The granularity of cached data.
The location of the client's cache (main memory or local disk).
How to propagate modifications of cached copies.
How to determine if a client's cached data is consistent.
The choices for these decisions are intertwined and related to the selected sharing
semantics.
6.7.1
Cache Unit Size:
The size of the cache can be either pages of a file or the entire file itself. For access
patterns that have strong locality of reference, caching a large part of the file results in a
high hit ratio, but at the same time, the potential for consistency problems also increases.
Furthermore, If the entire file is cached, it can be stored contiguously on the disk (or at
least in several large chunks), allowing high-speed transfers between the disk and
memory and thus improves performance. Caching entire files also offers other
advantages, such as fault-tolerance. This is because remote failures are visible only at the
time of open and close operations, supporting disconnected operation of clients which
already have the file cached. Whole file caching also simplifies cache management, since
clients only have to keep track of files and not individual pages. However, caching entire
files have two drawbacks. First, files larger than the local memory space (disk or main
memory) cannot be cached. Second the latency of open requests is proportional to the
size of the file and can be intolerable for large files.
If parts (blocks) of file stored in the cache, the cache and disk space is used more
efficiently. This scheme uses read-ahead technique to read blocks from the server
disk and buffer them on both the server and client sides before they are actually
needed in order to speed up the reading process. Increasing the caching unit size
increases the likelihood that the data for the next access will be found locally (i.e., the
hit ratio is increased); on the other hand, the time required for the data transfer and
the potential for consistency problems are increased. Selecting the unit size of
caching depends on the network transfer unit and the communication protocol being
used.
Earlier versions of the Andrew file system (AFS-1 and AFS-2) Coda and Amoeba
cached the entire files. AFS-3 uses partial-file caching, but its use has not
demonstrated substantial advantages in usability or performance, over the earlier
versions. When caching is done at large granularity, considerable performance
improvement can be obtained by the use of specialized bulk transfer protocols, which
reduce the latency associated with transferring the entire file.
6.7.2
Cache Location:
The cache location can be either at the server side, client side or both. Furthermore, the
cache can be also either in the main memory or in the disk. The server caching eliminates
accessing the disk on each access, but it still requires using the network to access the
server. Caching at the client side can avoid using the network.
Disk caches have one clear advantage in reliability and scalability. Modifications to
the cached data won't be lost even when the system crashes, and there is no need to
fetch the data again during recovery. Disk caches contribute to scalability by reducing
network traffic and server loads during client crashes. In Andrew and Coda, cache is
on the local disk, with a further level of caching provided by the Unix kernel in main
memory. On the other hand, caching in the main memory has four advantages. First,
main-memory caches permit workstations to be diskless, which make them cheaper
and quieter. Second, data can be accessed more quickly from a cache in the main
memory than a cache on the disk. Third, physical memories on the client workstations
are now large enough to provide high hit ratios. Fourth, the server caches will be in
the main memory regardless of where the client caches are located. Thus mainmemory caches emphasize reduced access time while disk caches emphasize
increased reliability and autonomy of single machines. If the designers decide to put
the cache in the client's main memory, three options are possible as shown in Figure
3.
1. Caching within each process. The simplest way is to cache files directly inside the
address space of each user process. Typically, the cache is managed by the system
call library. As files are opened, closed, read, and written, the library simply keeps the
most heavily used ones around so that when a file is reused, it may already be
available. When the process exits, all modified files are written back to the server.
Although this scheme has an extremely low overhead, it is only effective if individual
processes open and close files repeatedly.
2. Caching in the kernel. The kernel can dynamica1ly decide how much memory to
reserve for programs and how much for the cache. The disadvantage here is that a
kernel call is needed in all cache accesses, even on cache hits.
3. The cache manager as a user process. The advantage of a user-level cache manager is
that it keeps the microkernel operating system free of the file system code. In
addition, the cache manager is easier to program because it is completely isolated,
and is more flexible.
•
Write Policy: The write policy determines the way the modified cache blocks (dirty
blocks) are written back to their files on the server. The write policy has a critical
effect on the system's performance and reliability. There are three write policies:
write-through, delay-write, and write-on-close.
1. Write-through: the write-through policy is to write data through to the disk as soon
as it is placed in any cache. A write-through policy is equivalent to using remote
service for writes and exploiting caches for reads only. This policy has the advantage
of reliability, since little data is lost if the client crashes. However, this policy requires
each write access to wait until the information is written to the disk, which results in
poor write performance. Write-through and variations of delayed-write policies are
used to implement the UNIX like semantics of sharing.
2. Delay-write: blocks are initially written only to the cache and then written through to
the disk or server some time later. This policy has two advantages over write-through.
First, since writes are to the cache, write access completes much more quickly.
Second, data may be deleted before it is written back, in this case it needs not be
written at all. Thus a policy that delays writes several minutes can substantially
reduce the traffic to the server or disk. Unfortunately, delayed write schemes
introduce reliability problems, since unwritten data will be lost whenever a server or
client crashes. Sprite file system uses this policy with a 30-second delay interval.
3. Write-on-close: to write data back to the server when the file is closed. The write-onclose policy is suitable for implementing session semantics, but fail to give
considerable improvement on performance for files which are open for a short while.
It also increases the latency for close operations. This approach is used in the Andrew
file system and Network File System (NFS).
There is a tight relation between the write policy and semantics of file sharing. Writeon-close is suitable for session semantics. When files are updated concurrently and
occur frequently in conjunction with UNIX semantics, the use of delayed-write will
result in long delays. A write-through policy is more suitable for UNIX semantics
under such circumstances.
6.7.3 Client cache coherence in DFS
Cache coherence is the fact of reading the latest copy of the file. In Unix systems, the
user can always access the latest update. When working in a distributed computing
environment, this problem arises especially if you are cashing at the client’s machine. To
solve, we need to relax the problem by using the session. Open the file, update and then
close. So it is based on session semantics rather than read and write semantics. Updating
through the user will generate high traffic.
The following methods are used to maintain coherence (according to a model, e.g. UNIX
semantics or session semantics) of copies of the same file at various clients:
Write-through: writes sent to the server as soon as they are performed at the client high
traffic, requires cache managers to check (modification time) with server before can
provide cached content to any client
Delayed write: coalesces multiple writes; better performance but ambiguous semantics
Write-on-close: implements session semantics
Central control: file server keeps a directory of open/cached files at clients -> Unix
semantics, but problems with robustness and scalability; problem also with invalidation
messages because clients did not solicit them.
6.7.4 Cache Validation and Consistency:
Cache validation is required to find out if the data in the cache is a stale copy of the
master copy. If the client determines that its cached data is out of date, then future
accesses can no longer be served by that cached data. An up-to-date copy of the data
must be brought over from the file server. There are basically two approaches to
verifying the validity of the cached data:
1. Client-initiated approach. A client-initiated approach for validation involves
contacting the server to check if both have the same version of the file. Checking is
done usually by comparing header information such as a time-stamp of updates or a
version number (e.g., time stamp of the last update which is maintained in the i-node
information in UNIX). The frequency of the validity check is the crux of this
approach and it can vary from being performed with each access to a check initiated
over a fixed interval of time. This method can cause severe network traffic,
depending upon the frequency of checks. When it is performed with every access, the
file access experiences more delay than the file access served immediately by the
cache. Depending on its frequency, this kind of validity check can cause severe
network traffic, as well as consume precious server CPU time. This phenomenon was
the cause for Andrew designers to withdraw from this approach.
2. Server-initiated approach. In the server-initiated approach, whenever a client
caches an object, the server hands out a promise (called a callback or a token) that it
will inform the client before allowing any other client to modify that object. This
approach enhances performance by reducing network traffic, but it also increases the
responsibility of the server in direct proportion to the number of clients being served,
not a good feature for scalability. The server records for each client the (parts of) files
the client caches. Maintaining information on clients has significant fault tolerance
implications. A potential for inconsistency occurs when a file is cached in conflicting
modes by two different clients (i.e., at least one of the clients specified a write mode).
If session semantics is implemented, whenever a server receives a request to close a
file that has been modified, it should react by notifying the clients to discard their
cached data and consider it invalid. Clients having this file open at that time discard
their copy when the current session is over. Other clients discard their copy at once.
Under session semantics, the server needs not be informed about Opens of already
cached files. The server is informed about the Close of a writing session, however.
On the other hand, if a more restrictive sharing semantics is implemented, like UNIX
semantics, the server must be more involved. The server must be notified whenever a
file is opened, and the intended mode (Read or Write) must be indicated. Assuming
such notification, the server can act when it detects a file that is opened
simultaneously in conflicting modes by disabling caching for that particular file (as
done in Sprite). Disabling caching results in switching to a remote access mode of
operation. The problem with the serve-initiated approach is that it violates the
traditional client-server model, where clients initiate activities by requesting the
desired services. Such a violation can result in irregular and complex code for both
clients and servers.
The implementation techniques for cache consistency check depend on the semantics
used for sharing files. Caching entire files is a perfect match for session semantics.
Read and Write accesses within a session can be handled by the cached copy, since
the file can be associated with different images according to the semantics. The cache
consistency problem is reduced to propagating the modifications performed in a
session to the master copy at the end of a session. This model is quite attractive since
it has simple implementation. Observe that coupling this semantics with caching parts
of files may complicate matters, since a session is supposed to read the image of the
entire file that corresponds to the time it was opened.
A distributed implementation of the UNIX semantics using caching has serious
consequences. The implementation must guarantee that at all times only one client is
allowed to write to any of the cached copies of the same file. A distributed conflict
resolution scheme must be used in order to arbitrate among clients wishing to access
the same file in conflicting modes. In addition, once a cached copy is modified, the
changes need to be propagated immediately to the rest of the cached copies. Frequent
writes can generate tremendous network traffic and cause long delays before requests
are satisfied. This is why implementations (e.g., Sprite) disable caching altogether
and resort to remote service once a file is concurrently open in conflicting modes.
Observe that such an approach implies some form of a server-initiated validation
scheme, where the server makes a note of all Open calls. As was stated, UNIX
semantics lend itself to an implementation where all requests are directed and served
by a single server.
The immutable shared files semantics eliminates the cache consistency problem
entirely since the files can not be written. The transactions-like semantics can be
implemented in a straightforward manner using locking. In this scheme, all the
requests for the same file are served by the same server on the same machine as is
done in the remote service.
For session semantics we can easily implement, cache consistency by propagating
changes to the master copy after closing the file. For implementing UNIX semantics,
we have to propagate write to a cache, not only to the server, but also to other clients
having a stale copy of the cache. This may lead to a poor performance and that is why
many DFS (such as Sprite) switch to remote service when a client opens a file in a
conflicting mode. Write-back caching is used in Sprite and Echo. Andrew and Coda
use a write-through policy, for implementation simplicity and for reducing the
chances of server data being stale due to client crashes. Both these systems use
deferred write-back while operating in the disconnected mode, during server or
network failures.
Maintaining cache coherence is unnecessary if the data in the cache is treated as a
hint and is validated upon use. File data in a cache cannot be used as a hint since the
use of a cached copy will not reveal whether it is current or stale. Hints are most often
used for file location information in DFS. Andrew for instance caches individual
mappings of volumes to servers. Sprite caches mappings of pathnames prefixes to
servers.
Caching can thus handle a substantial amount of remote accesses in an efficient
manner. This leads to performance transparency. It is also believed that client caching
is one of the main contributing factors towards fault-tolerance and scalability. The
effective use of caching can be done by studying the usage properties of files. For
instance we could have write-through if we know that the sequential write-sharing of
user files is uncommon. Also executables are frequently read, but rarely written and
are very good candidates for caching. In a distributed system, it may be very costly to
enforce transaction like semantics, as required by databases, which exhibit poor
locality, fine granularity of update and query and frequent concurrent and sequential
write sharing. In such cases, it is best to provide explicit means, outside the scope of
the DFS. This is the approach followed in the Andrew and Coda DFS.
6.7.5 Comparison of Caching and Remote Service
The choice between caching and remote service is a choice between potential for
improved performance and simplicity. Following are the advantages and disadvantages of
the two methods.
•
•
•
•
•
•
When using caching scheme, a substantial amount of the remote access can be
handled efficiently by the local cache. In DFSs, such scheme's goal is to reduce
network traffic. In remote access, there is an excessive overhead in network traffic
and increase in the server load.
The total network overhead in transmitting big chunks of data, as done in caching, is
lower than when series of short responses to specific requests are transmitted.
The cache consistency problem is the major drawback to caching. When writes are
frequent, the consistency problems incur substantial overhead in terms of
performance, network traffic, and server load.
To use caching and benefit from it, clients must have either local disks or large main
memories. Clients without disks can use remote-service method without any
problems.
In caching, data is transferred all together between the server and client and not in
response to the specific needs of a file operation. Therefore, the interface of the server
is quite different from that of the client. On the other hand, in the remote service the
interface of the server is just an extension of the local file system interface across the
network.
It is hard to emulate the sharing semantics of a centralized system (Unix sharing
semantics) in a system using caching. While using remote service, it is easier to
implement and maintain the Unix sharing sematnics.
6.8 Concurrency Control
6.8.1 Transaction in a Distributed File System
The term atomic transaction is used to describe the phenomenon of a single client
carrying out a sequence of operations on a shared file without interference from another
client. The net result of every transaction must be the same as if each transaction is
performed at a completely separate instance of time. The atomic transaction ( in a file
service ) enables a client program to define a sequence of operations on a file without the
interference of any other client program to ensure a justifiable result. Synchronization of
the operations by a file server that supports the transaction, must be done to ensure the
above criteria. Also if the file, undergoing any modification by the file service, faces any
unexpected server or client process halts due to a hardware error or the software fault
before the transaction is completed, the server ensures the subsequent restoration of the
file to the original state before the transaction started. Though it is a sequence of
operations, the atomic transaction can be viewed to be a single step operation from the
clients point of view, to restore from one stable state to the other one. Either the
transaction will be done successfully or the file status will be restored to the original one.
The atomic transaction must satisfy two criteria to prevent the conflict between two
concurrently accessing client processes requesting operations in the same data item
concurrently. First it should be recoverable. Secondly the concurrent execution of several
atomic transactions must be serially equivalent, i.e. the effect of several concurrent
transactions would be the same if is it done one at a time. To ensure the proper atomicity
of transactions concurrency controlling is done via locking, time stamping, optimistic
concurrency control etc., details of which is explained later in the concurrency control
chapter.
6. 9 Security and Protection
To encourage sharing of files between users, the protection mechanism should allow a
wide range of policies to be specified. As we discussed in the directory service, there are
two important techniques to access files: 1) Capability-based access, and 2) Access
Control List.
The client process that has a valid Unique File Identifier (UFID) can use the file service
to access the file, by the help of the directory service, which stores mappings from users'
names for files to UFIDs. When a service or a group passes the authentication check,
such as, a name or a password check they are given a UFID, which generally contains a
large sparse number to reduce the counterfeiting. The UFIDs are issued after the user or
the service is registered in the system. The authentication is done by the authentication
service that maintains a table of user names, service names, passwords and the
corresponding user identifier (ID). Each file has an owner (initially the creator) whose
password is stored in the attributes of the created file, that will be subsequently used by
the identity based file access control scheme. An access control list contains the user IDs
of all the users who are entitled to access the file directly or indirectly.
Generally the owner of the file can perform all file operations on using the file service.
The other members have lesser access in the same file (e.g., read-only). The users of any
file can be classified based on their requirements and needs to access a given file as
follows.
• The file's owner.
• The directory service who is responsible for controlling the access and mapping the
file by its text names.
• A client who is given special permissions to access the file on behalf of the owner to
manage the file contents and thereby is recognized by the system manager.
•
All other clients.
In large distributed systems, simple extensions of the mechanisms used by time-sharing
operating systems are not sufficient. For example some systems implement authentication
by sending a password to the server, which then validates it. Besides being risky, the
client is not certain of the identity of the server. Security can be built on the integrity of a
relatively small number of servers, rather than a large number of clients, as is done in the
Andrew file system.
The authentication function is integrated with the RPC' mechanism. When a user logs on,
the user’s password is used as a key to establish a connection with an authentication
server. This server hands the user a pair of authentication tokens, which are used by the
user to establish secure RPC connections with any other server. Tokens expire
periodically, typically in 24 hours. When making an RPC call to a server, the client
supplies a variable-length identifier and the encryption key to the server. The server looks
up the key to verify the identity of the client. At the same time, the client is assured that
the server has the capability of looking up its key and hence can be trusted. Randomized
information in the handshake guards against replays by suspicious clients. It is important
that the authentication servers and file servers run on physically secured hardware and
safe software. Furthermore, there may be multiple redundant instances of the
authentication server, for greater availability.
Access Rights
In a DFS, there is more data to protect from a large number of users. The access
privileges provided by the native operating systems are either inadequate or absent. Some
DFS such as Andrew and Coda, maintain their own schemes for deciding access rights.
Andrew implements a hierarchical access-list mechanism, in which a protection domain
consists of users and groups. Membership privileges in a group are inherited and the
user's privileges are the sum of the privileges of all the groups that he or she belongs to.
Also privileges are specified for a unit of a file system such as directories, rather than
individual files. Both these factors simplify the state information to be maintained.
Negative access rights can also be specified, for quick removal of a user from critical
groups. In case of conflicts, negative rights overrule positive rights.
6.10 Case Studies
6.10.1 SUN Network File System ( NFS )
The Network File System (NFS) is designed, has been specified and implemented by Sun
Microsystems Inc. since 1985.
Overview
NFS views a set of interconnected workstations as a set of independent machines with
independent file systems. It allows some degree of sharing based on a client-server
relationship among the file systems in a transparent manner. A machine may be both a
client and server. Sharing is allowed between any pair of machines, not only with
dedicated server machines. Consistent with the independence of machines is the fact that
sharing of a remote file system affects only the client and no other machine. Hence there
is no notion of a globally shared file system as in Locus, Sprite and Andrew.
Advantages of Sun's NFS
• Support diskless Sun workstations entirely by way of the NFS protocol.
• It provides the facility for sharing files in a heterogeneous environment of machines,
operating systems, and networks. Sharing is accomplished by mounting a remote file
system, then reading or writing files in place.
• It is open-ended. Users are encouraged to interface it with other systems.
It was not designed by extending SunOS into the network. Instead operating system
independence was taken as a NFS design goal, along with machine independence, simple
crash recovery, transparent access, maintenance of UNIX file system semantics, and
reasonable performance. These advantages have made NFS a standard in the UNIX
industry today.
NFS Description
NFS provides transparent file access among computers of different architectures over one
or more networks and keeps different file structures and operating system transparent to
users. A brief description of the salient points is given below.
It is a set of primitives that defines the operations, which can be made on a distributed file
system. The protocol is defined in terms of a set of Remote Procedure Call (RPC), their
arguments and results, and their effects.
NFS protocol
1. RPC and XDR: RPC mechanism is implemented as a library of procedures plus a
specification for portable data transmission, known as the External Data
Representation (XDR). Together with RPC, XDR provides a standard I/O library for
interprocess communication. The RPCs are used for defining NFS protocol. They are
null(), lookup(), create(), remove(), getattr(), setattr(), read(), write() , rename(),
link(), simlink(), readlink(), mkdir() , rmdir(), readdir(), statfs(). The most common
NFS procedure parameter is a structure called file handle, which is provided by the
server and used by retrying the call until the packet gets through.
2. Stateless protocols: The NFS protocol is stateless because each transaction stands on
its own. The server does not keep track of any past client requests.
3. Transport independent: New transport protocol can be plugged into the RPC
implementation without affecting the higher-level protocol code. In the current
implementation, NFS uses UDP/IP protocol as the transport protocol.
The UNIX operating system does not guarantee that the internal file identification is
unique within a local area network. In distributed system, it is possible for a file or a file
system identification to be the same for another file on a remote system. To solve this
problem, Sun has added a new file system interface to the UNIX operating system kernel.
This improvement can uniquely locate and identify both local and remote files. The file
system interface consists of the virtual file system (VFS) interface and the Virtual Node
(vnode) interface. Instead of the inode the operating system deals with the vnode.
Client
Server
System calls interface
VFS interface
Other types
of file
systems
Unix 4.2 file
systems
VFS interface
NFS
client
NFS
server
RPC/XDR
Unix 4.2 file
systems
RPC/XDR
disk
disk
Network
Figure 6.5 Schematic view of NFS architecture
When a client makes a request to access a file, it goes through the VFS that decides
whether the file is local or remote. It uses the vnode to determine if the file is local or
remote. If it is local, it refers to the I-node and the file is accessed as any other Unix files.
If it is remote, the file handler is identified which in turn uses the RPC protocol to contact
the remote server and obtain the required file.
VFS defines the procedure and data structures that operate on the file system as a whole
and vnode interface defines the procedures that operate on an individual file within that
file system type
Pathname Translation
Pathname translation is done by breaking the path into component names and doing a
separate NFS lookup call for every pair of component name and directory vnode. Thus,
lookups are performed remotely by the server. Once a mount point is crossed every
component lookup causes a separate RPC to the server. This expensive pathname
traversal is needed, since each client has a unique layout of its logical name space,
dictated by the mounts if performed. A directory name lookup cache at the client, which
holds the vnodes for remote directory names, speeds up references to files with the same
initial pathname. The cache is discarded cache is discarded when attributes returned from
the server do not match the attributes of the cached node.
Caching
There is a one-to-one correspondence between the regular UNIX system calls for
file operations and NFS protocol RPCs with the exception of opening and closing files.
Hence a remote file operation can be translated directly to the corresponding RPC.
Conceptually, NFS adheres to the remote service paradigm but in practice buffering and
caching techniques are used for the sake of performance, i.e. no correspondence exists
between a remote operation and a RPC. File blocks and file attributes are fetched by the
RPCs and cached locally.
Caches are of two types: file block cache and file attribute (i-node information)
cache. On a file open, the kernel checks with the remote server about whether to fetch or
revalidate the cached attributes by comparing the time-stamps of the last modification.
The cached file blocks are used only if the corresponding cached attributes are up to date.
Both read-ahead and delayed-write techniques are used between the client and server.
Clients do not free delayed write blocks until the server confirms the data is written to the
disk
Performance tuning of the system makes it difficult to characterize the sharing
semantics of NFS. New files created on a machine may not be able to visible elsewhere
for a short duration of time. It is indeterminate whether writes to a file at one site are
visible to other sites that have the file open for reading. New opens of that file observe
only the changes that have been flushed to the server. Thus NFS fails to provide strict
emulation of UNIX semantics.
Summary
•
•
•
Logical name structure: Global name hierarchy does not exist, i.e. every machine
establishes its own view of the name structure. Each machine has its own root serving
as private and absolute point of reference for its own view of the name structure
Hence users enjoy some degree of independence, flexibility and privacy sometimes at
the expense of administrative complexity
Remote service: When a file is accessed transparently I/O operations are performed
according to the remote service method, i.e. the data in the file is not fetched at once
instead the remote site potentially participates in each read and write operation.
Fault tolerance: Stateless approach for the design of the servers results in resiliency to
client, server, or network failures.
•
Sharing semantics: NFS does not provide UNIX semantics for concurrently open files
Server 1
Client
(root)
(root)
export
. . . vmunix
Server 2
(root)
usr
nfs
Remote
people
big jon bob . . .
mount
Remote
students
x
staff
mount
users
jim ann jane joe
Figure 6.8
Local and remote file systems accessible on an NFS client. The file
system mounted at /usr/students in the client is actually the sub-tree located at
/export/people in Server 1; the file system mounted at /usr/staff in the client is actually
the sub-tree located at /nfs/users in Server 2.
6.10.2 Sprite
Sprite is featured by its performance. Sprite uses memory caching as a main tool
to achieve good performance. Sprite is an experimental distributed system developed at
the University of California at Berkeley. It is a part of the SPUR project, whose goal is
the design and construction of high performance multiprocessor workstation.
Overview
Designers of Sprite envisioned the next generation of workstations as powerful
machines with vast main memory of 100 to 500 MB. By caching files from dedicated
servers the physical memories compensate for lack of local disks in diskless
workstations.
Features
Sprite uses ordinary files to store data and stacks of running processes instead of
special disk partitions used by many versions of UNIX. This simplifies process migration
and enables flexibility and sharing of the space allocated for swapping. In Sprite, clients
can read random pages from a server’s (physical) cache faster than from a local disk,
which shows that a server with a large cache may provide better performance than from a
local disk. The interface provided by Sprite is very similar to the one provided by UNIX
where a single tree is encompasses all the files and devices in a network making them
equally and transparently accessible from every workstation. Location transparency is
complete i.e. a file’s network location cannot be discerned from its name.
Description
Caching
An important aspect of the sprite file system is the capitalizing on the large main
memories and advocating diskless workstations, storing file caches in-core. The same
caching scheme is used to avoid local disk accesses as well as to speed up remote
accesses. In sprite, file information is cached in the main memories of both servers
(workstations with disks), and clients (workstations wishing to access files on nonlocal
disks). The caches are organized on block basis, each being 4Kb.
Sprite does not use read ahead to speed up sequential read, instead it uses delayed
write approach to handle file modifications. Exact emulation of UNIX semantics is one of
Sprite’s goals. A hybrid cache validation method is used for this end. Files are associated
with version number. When a client opens a file, it obtains the file’s current version
number from the server and compares this number to the version number associated with
the cached blocks for that file. If the version numbers are different, the client discards all
cached blocks for the file and reloads its cache from the server when the blocks are
needed.
Looking up files with prefix tables
Sprite presents its user with a single file system hierarchy that is composed of
several subtrees called domains. Each server stores one or more domains. Each machine
maintains a server map called prefix table. This table maps domains to servers. This
mapping is built and updated dynamically by a broadcast protocol. Every entry in a prefix
table corresponds to one of the domains. It contains the pathname of the topmost
directory in the domain, the network address of the server storing the domain, and a
numeric designator identifying the domain’s root directory for the storing server. This
designator is an index into the server table of open files; it saves repeating expensive
name translation.
Every lookup operation for an absolute pathname starting with the client
searching for its prefix table for the longest prefix matching the given file name. The
client strips the matching prefix from the file name and sends the remainder of the name
to the selected server along with the designator from the prefix table entry. The server
uses this designator to locate the root directory of the domain, then proceeds by usual
UNIX pathname translation for the remainder of the file name. When the server succeeds
in completing the translation, it replies with a designator for the open file.
Location Transparency
Like almost all modern network systems, Sprite achieved location transparency.
This means that the users should be able to manipulate files in the same ways they did
under time-sharing on a single machine; the distributed nature of the file system and the
techniques used to access remote files should be invisible to users under normal
conditions. Most network file systems fail to meet the transparency goal in one or more
ways. Earliest systems allowed remote file access only with a few special programs.
Second generation systems allow any application to access files on any machine in the
network, but special names must be used for remote files. Third generation network file
systems such as Sprite and Andrew, provide transparency. Sprite provides complete
transparency, so applications running on different workstations see the same behavior
they would see if all applications were executing on a single timeshared machine. Also
sprite provides transparent access to remote I/O devices. Like UNIX, Sprite represents
device as special files, unlike most versions of UNIX. It also allows any process to access
any device, regardless of device location.
Summary
The Sprite file system can be summarized by the following points.
• Semantics: Sprite sacrifices performance in order to emulate UNIX semantics thus
eliminating the possibility and benefits of caching in big chunks.
• Extensive use of caching: Sprite is inspired by the vision of diskless workstations
with huge main memories and accordingly relies heavily on caching.
• Prefix tables: For LAN based systems prefix tables are a most efficient, dynamic,
versatile and robust mechanism for file lookup the advantages being the built-in
facility for processing whole prefixes of pathnames and the supporting broadcast
protocol that allows dynamic changes in tables.
6.9.3 Andrew File System
Andrew, which is distinguished by its scalability, is a distributed computing
environment that has been under development since 1983 at CMU. The Andrew file
system constitutes the underlying information-sharing mechanism among users of the
environment. The most formidable requirements of Andrew is its scale.
Overview
Andrew distinguishes between client machine and dedicated server machines. Clients are
presented with a partitioned space of file names: a local name space and a shared name
space. A collection of dedicated servers, collectively called Vice represents the shared
name space to the clients as an identical and location-transparent file hierarchy. The local
name space is the root file system of a workstation from which the shared name space
descends. (Figure ) Workstations are required to have local disks where they store their
local name space, whereas servers collectively are responsible for the storage and
management of the shared name space. The local name space is small and distinct from
each workstation and contains system programs essential for autonomous operation and
better performance, Temporary files, and files the workstation owner explicitly wants, for
privacy reasons, to store locally.
Local
Shared
/ (root)
tmp
bin
. . .
vmunix
cmu
bin
Symbolic
links
Figure 6.8. Andrew’s name space
The key mechanism selected for remote file operations is whole file caching. Opening a
file causes it to be cached, in its entirety, in the local disk. Reads and writes are directed
to the cached copy without involving the servers. Entire file caching has many merits, but
cannot efficiently accommodate remote access to very large files. Thus, a separate design
will have to address the issue of usage of large databases in the Andrew environment.
Features
•
•
•
User mobility: Users are able to access any file in the shared name space from any
workstation. The only noticeable effect of a user accessing files not from the usual
workstation would be some initial degraded performance due to the caching of files.
Heterogeneity: Defining a clear interface to Vice is a key for integration of diverse
workstation hardware and operating system. To facilitate heterogeneity, some files in
the local /bin directory are symbolic links pointing to machine-specific executable
files residing in Vice.
Protection: Andrew provides access lists for protecting directories and the regular
UNIX bits for file protection. The access list mechanism is based on recursive groups
structure.
Workstations
Servers
User Venus
program
Vice
UNIX kernel
UNIX kernel
Venus
User
program
Network
UNIX kernel
Vice
Venus
User
program
UNIX kernel
UNIX kernel
Figure 6.9 Distribution of processes in the Andrew File System
Description
Scalability of the Andrew File system
There are no magic guidelines to ensure the scalability of a system. But the Andrew file
system presents some methods to make it scalable.
Location Transparency
Andrew offers true location transparency: the name of a file contains no location
information. Rather, this information is obtained dynamically by clients during normal
operation. Consequential, administrative operations such as the addition or removal of
servers and the redistribution of files on them are transparent to users. In contrast, some
file systems require users to explicitly identify the site at which a file is located. Location
transparency can be viewed as a binding issue. The binding of location to name is static
and permanent when pathnames with embedded machine names are used. The binding is
dynamic and flexible in Andrew. Usage experience has confirmed the benefits of a fully
dynamic location mechanism in a large distributed environment.
Client Caching
The caching of data at clients is undoubtedly the architectural feature that
contributed most to scalability in a distributed file system. Caching has been an integral
part of the Andrew designs from the beginning. In implementing caching one has to make
three decisions: where to locate the cache, how to maintain cache coherence, and when to
propagate modifications.
Andrew cache on the local disk, with a further level of file caching by the Unix
kernel in main memory. Disk caches contribute to scalability by reducing network traffic
and server load on client reboots.
Cache coherence can be maintained in two ways. One approach is for the client to
validate a cached object upon use. A more scalable approach is used in Andrew. When a
client caches an object, the server hands out a promise (called callback or token) that it
will notify the client before allowing any other client to modify that object. Although
more complex to implement, this approach minimizes server load and network traffic,
thus enhancing scalability. Callbacks further improve scalability by making it viable for
clients to translate pathnames entirely locally.
Existing systems use one of the approached to propagating modifications from
client to server. Write-back caching, used in Sprite, is the more scalable approach.
Andrew uses a write through caching scheme. This is a notable exception to scalability
being the dominant design consideration in Andrew.
Bulk Data transfer
An important issue related to caching is the granularity of data transfers between
client and server. Andrew uses whole-file caching. This enhances scalability by reducing
server load, because clients need only contact servers on file open and close requests. The
far more numerous read and write operations are invisible to servers and cause no
network traffic. Whole-file caching also simplifies cache management because clients
only have to keep track of files, not individual pages, in their cache. When caching is
done at large granularity, considerable performance improvement can be obtained by the
use of a specialized bulk data-transfer protocol. Network communication overhead
caused by protocol processing typically accounts for a major portion of the latency in a
distributed file system. Transferring data in bulk reduces this overhead.
Token-Based Mutual Authentication
The approach used in Andrew to implement authentication is to provide a level of
indirection using authentication tokens. When a user logs in to a client, the password
typed in is used as the key to establish a secure RPC connection to an authentication
server. A pair of authentication tokens are then obtained for the user on this secure
connection. These tokens are saved by the client and are used by it to establish secure
RPC connections on behalf of the user to file servers. Like a file server, an authentication
server runs on physically secure hardware. To improve scalability and to balance load,
there are multiple instances of the authentication server. Only one instance accepts
updates; the others are slaves and respond only to queries.
Hierarchical Groups and Access Lists
Controlling access to data is substantially more complex in large-scale systems
than it is in smaller systems. There is more data to protect and more users to make access
control decisions about. To enhance scalability, Andrew organize their protection
domains hierarchically and support a full-fledged access-list mechanism.
The protection domain is composed of users and groups. Membership in a group is
inherited, and a user’s privileges are the cumulative privileges of all groups he or she
belongs to, either directly or indirectly.
Andrew uses an access-list mechanism for file protection. The total rights
specified for a user are the union of the rights specified for him and the groups he or she
belongs to. Access lists are associated with directories rather than individual files. The
reduction in state obtained by this design decision provides conceptual simplicity that is
valuable at large scale. Although the real enforcement of protection is done on the basis
of access lists, Venus superimposes an emulation of Unix protection semantics by
honoring the owner component of the Unix mode bits on a file. The combination of
access lists on directories and mode bits on files has proved to be an excellent
compromise between protection at fine granularity, scalability, and Unix compatibility.
Data Aggregation
In a large system, consideration of interoperability and system administration
assume major significance. To facilitate these functions, Andrew organize file system
data into volumes. A volume is a collection of files located on one server and forming a
partial subtree of the Vice name space. Volumes are invisible to application programs and
are only manipulated by system administrators. The aggregation of data provided by
volumes reduces the apparent size of the system as perceived by operators and system
administrators. Our operational experience in Andrew confirms the value of the volume
abstraction in a large distributed file system.
Decentralized Administration
A large distributed system is unwieldy to manage as a monolithic entity. For
smooth and efficient operation, it is essential to delegate administrative responsibility
along lines that parallel institutional boundaries. Such a system decomposition has to
balance site autonomy with the desirable but conflicting goal of system-wide uniformity
in human and programming interfaces. The cell mechanism of AFS-3 is an example of a
mechanism that provides this balance.
A cell corresponds to a completely autonomous Andrew system, with its own
protection domain, authentication and file servers, and system administrators. A
federation of cells can cooperate in presenting users with a uniform, seamless file name
space.
Heterogeneity
As a distributed system evolves it tends to grow more diverse. One factor
contributing to diversity is the improvement in performance and decrease in cost of
hardware over time. This makes it likely that the most economical hardware
configurations will change over the period of growth of the system. Another source of
heterogeneity is the use of different computer platforms for different applications.
Andrew did not set out to be a heterogeneous computing environment. Initial plans for it
envisioned a single type of client running one operating system, with the network
constructed of a single type of physical media. Yet heterogeneity appeared early in its
history and proliferated with time. Some of this heterogeneity is attributed to the
decentralized administration typical of universities, but much of it is intrinsic to the
growth and evolution of any distributed system.
Coping with heterogeneity is inherently difficult, because of the presence of
multiple computational environments, each with its own notions of file naming and
functionality. The PC Server [Rifkin, 1987] is used to perform the function in the Andrew
environment.
Summary
The highlights of the Andrew file system are:
• Name space and service model: Andrew explicitly distinguishes among local and
shared name spaces as well as among clients and servers. Clients have small and
distinct local name space and can access the shared name space managed by the
servers.
• Scalability: Andrew is distinguished by its scalability, the strategy adopted to address
scale is whole file caching in order to reduce server load. Servers are not involved in
reading and writing operations. The callback mechanism was invented to reduce the
number of validity checks.
• Sharing semantics: Andrew’s semantics which are simple and well-defined ensure
that a file’s updates are visible across the network only after a file has been closed.
6.9.4 Locus
Overview
Locus is an ambitious project aimed at building a full-scale operating system. The
features of Locus are automatic management of replicated data, atomic file updates,
remote tasking, ability of tolerate failures to a certain extent, and full implementation of
nested transactions . The Locus has been operational in UCLA for several years on a set
of mainframes and workstations connected by an Ethernet. The main component of
Locus is its DFS. It represents a single tree structure naming hierarchy to users and
applications. This structure covers all objects of all machines in the system. Locus is a
location transparent system, i.e. from the name of an object you can not decide its
location in the network.
Features
Fault tolerance issues have special emphasis in Locus. Network failures may split
the network into two or more disconnected subnetworks (partitions). As long as at least
one copy of a file is available in a subnetwork, read requests are served and it is still
guaranteed that the version read is the most recent one available in that disconnected
network. Upon reconnection of these subnetworks, automatic mechanisms take care of
updating stale copies of files.
Seeking high performance in the design of Locus has led to incorporating network
functions like formatting, queuing, transmitting, and retransmitting messages into the
operating system. Specialized remote procedure protocols were devised for kernel-tokernel communication. Lack of the multilayering (as suggested in the IS0 standard)
enabled achieving high performance for remote operations. A file in Locus may
correspond to a set of copies (replications) distributed in the system. It is the
responsibility of Locus to maintain consistency and coherency among the versions of a
certain file. Users have the option to choose the number and locations of the replicated
files.
Description
Locus uses the logical name structure to hide both location and replication details
from users and applications. A removable file system in Locus is called filegroup. A file
group is the component unit in Locus. Virtually, logical filegroups are joined together to
form a unified structure. Physically, a logical filegroup is mapped to multiple physical
containers (packs) residing at various sites and storing replica of the files of that
filegroup. These containers correspond to disk partitions. One of the packs is designated
as the primary copy. A file must be stored at the site of the primary copy and can be
stored at any subset of the other sites where there exists a pack corresponding to its
filegroup. Hence, the primary copy stores the filegroup completely where the rest of the
pack might be partial. The various copies of a file are assigned to the same i-node slot for
all files it does not store. Data page numbers may be different on different packs, hence
reference over the network to data pages use logical page numbers rather than physical
ones. Each pack has a mapping of these logical numbers to physical numbers. To
facilitate automatic replication management, each i-node of a file copy contains a version
number, determining which copy dominates other copies.
Synchronization of Accesses to Files
Locus distinguishes between three logical roles in file accesses; each one is
performed by a different site:
1. Using Site (US) issues the requests to open and access a remote file.
2. Storage Site (SS) is the selected site to serve the requests.
3. Currently Synchronization Site (CSS) enforces a global synchronization policy for a
filegroup and selects an SS for each open request referring to a file in the filegroup.
There is at most one CSS for each filegroup in any set of communicating sites. The
CSS maintains the version number and a list of physical containers for every file in
the filegroup.
Reconciliation of filegroups at partitioned sites
The basic approach to achieve fault tolerance in Locus is to maintain within a
single subnetwork, consistency among copies of a file. The policy is to allow updates
only in a partition that has the primary copy. It is guaranteed that the most recent version
of a file in a partition is read. To achieve this the system maintains a commit count for
each filegroup, enumerating each commit of every file in the filegroup. The commit
operation consists of moving the incore i-node to the disk i-node. Each pack has a lowerwater-mark (lwm) that is a commit out value upto which the system guarantees that all
prior commits are reflected in the pack. The primary copy pack (stored in the CSS) keeps
a list of enumerating the files in a filegroup and the corresponding commit counts of the
recent commits in the secondary storage. When a pack joins a partition it attempts to
contact the CSS and checks whether its lvm is within the recent commit list range. If so
the pack site schedules a kernel process that brings the pack to a consistent state by
copying only the files that reflect commits later than that of the site’s lwm. If the CSS is
not available writing is disallowed in this partition by reading is possible after a new CSS
is chosen. The new CSS communicates with the partition members to itself informed of
the most recent available version of each file in the filegroup. Then other pack sites can
reconcile with it. As a result all communicating sites see the same view of the filegroup,
and this view is as complete as possible given a particular partition. Since updates are
allowed within the partition with the primary copy and Reads are allowed in the rest of
the partitions, it is possible to read out-of-date replicas of a file. Thus Locus sacrifices
consistency or the ability to continue to both update and read files in a partitioned
environment.
When a pack is out of date the system invokes an application-level process to
bring the file group up to date. At this point the system lacks sufficient knowledge of the
most recent commits to identify the missing updates. So the site inspects the entire i-node
space to determine which files in its pack are out of date.
Summary
An overall profile of Locus can be summarized by the following issues.
• Distributed operating system: Due to the multiple dimensions of a transparency in
Locus it comes close to the definition of a distributed operating system in contrast to
a collection of network services.
• Implementation Strategy: Kernel operation is the implementation strategy in Locus
the common pattern being kernel-to-kernel implementation via specialized high
performance protocols.
• Replication: Locus uses a primary copy replication scheme the main merit of which
is the increased availability of directories that exhibit high read-write ratio.
• Access synchronization: UNIX semantics are emulated to the last detail inspite of
caching at multiple USs. Alternatively locking facilities are provided.
• Fault tolerance: Some of the fault tolerance mechanisms in Locus are atomic update
facility, merging replicated packs after recovery and a degree of independent
operation of partitions.
Conclusion:
In this chapter, we were able discuss the characteristics of the File system which is a set
of services that could be provided to the client (user). We introduced a group of terms
such as NFS, Network file system, which is a case study in the chapter as well as Andrew
file system. The chapter started by studying the File System Characteristics and
Requirements. In this section, we defined the file system role, the file access granularity,
file access type, transparency, network transparency, mobile transparency, performance,
fault tolerance and scalability. In the file model and organization section, we compared
the directory service, file service and block service and gave examples showing the
difference between each service.
In naming and transparency section, we discussed transparency issues related to naming
files, naming techniques and implementation issues. In the naming files, we presented the
naming transparency, network transparency, mobile transparency and location
independence.
Later in the chapter, we defined the sharing semantics. The most common types of
sharing semantics are: Unix Semantics, Session Semantics, Immutable Shared Semantics
and Transaction Semantics.
Fault tolerance is an important attribute in distributed system, so we prepared a section in
this chapter to discuss the methods to improve the fault tolerance of a DFS that could be
summarized in improving availability and the use of redundant resources.
Due to its importance in Distributed File systems, caching occupied a big area from our
discussion in chapter 6. Caching is a common technique used to reduce the time it takes
for a computer to retrieve information. Ideally, recently accessed information is stored in
a cache so that a subsequent repeats access to that same information can be handled
locally without additional access time or burdens on network traffic. It is not easy to
design a cache system, a lot of factors should be taken in consideration such the cache
unit size, cash location, client location and validation and consistency.
The chapter goes over briefly over concurrency control and security issues which will be
discussed deeply in the coming chapters and ends in discussing some case studies such as
Sun NFS and Andrew file system.
Questions:
1) What is a Distributed File system?
2) Briefly summarize the file system requirements.
3) What are the most common types of file sharing semantics?
4) Why fault tolerance is an important attribute to file systems?
5) Why do we need caching? What factors should be taken in consideration
when designing cache systems
6) Briefly summarize the case studies provided in this chapter.
7) Present a file system that is not discussed in this chapter and compare it to
NFS and AFS.
References:
[Chu, 1969] W. W. Chu. Optimal file allocation in a minicomputer information system,
IEEE Transaction on Computer, C-l 8 10, Oct. 1969.
[Gavish, 1990], Bezalel Gavish, R. Olivia and Liu Sheng, Dynamic file migration in
distributed computer systems, Communicatins of the ACM, Vol. 33, No. 2, Feb. 1990.
[Chung, 1991], Hsiao-Chung Cheng and Jang-Ping Sheu,
Design and implementation of a distributed file system,
Software-Practice and Experience, Vol. 21(7), P 657-675, July 1991
[Walker, 1983] B. Walker et al. “The LOCUS distributed operating system,”
Proc. of 1983 SIGOPS Conference, PP. 49-70, Oct. 1983.
[Coulouris and Dolimore, 1988] George F. Coulouris and Jean Dolimore, Addison
Wesley, “Distributed Systems : Concepts and Design” 1988, P 18-20
[Nankano, ] X. Jia, H Nankano, Kentaro Shimizu and Moment Maekawa, “Highly
Concurrent directory Management in Galaxy Distributed Systems”,Proceedings on
International symposium on Database in Parallel and Distributed Systems, P 416- 426.
[Thompson, 1931] K. Thompson, “UNIX Implementation”, Bell systems Technical
Journal, Vol. 57, no : 6, part 2, P 1931-46.
[Needham] R.M Needham and A.J Herbert, “The Cambridge Distributed Computing
System”, Addison Wesley.
[Lampson , 1981] “Atomic Transactions ‘I, B.W Lampson , Lamport et al., 1981
[Gidfford, 1988] D.K Gifford, “Violet : An experimental decentralized system”, ACM
Operating System review, Vol. 3, No 5. “Network Programming”, Sun Micro Systems
Inc. May 1988.
[Rifkin, 1987] Rifkin, R.L Hamilton, M.P Sabrio S. Shah and K. Yueh, “R.F.S
Architectural Overview”, USENIX, 1987.
[Gould, 1987] Gould, “The Network File System implemented on 4.3 BSD”, USEXIX,
1987.
[Howard, 1988] J. Howard et al “Scale and performance in a Distributed File Systems”.
ACM Transaction on Computer Systems, P 55-8 1, Feb.’ 1988.
“The design of a capability based operating system”, Computer Journal, Vol.
29, No 4, P 289-300.
[levy, 1990A] E. Levy and A. Silberschatz, “Distributed File Systems : Concepts and
Examples”, ACM Computer Surveys, P 321-374, Dec’1990.
[Nelson, ] M. Nelson et al “Caching in the Sprite Network File System”, ACM
transaction on computer Systems.
[Mullender, 1990] Sape. J. Mullender and Guirdo Van Rossum “Ameoba - A Distributed
Operating system for 1990’s”, CWI, Center for Mathematics and Computer Science.
[ Needham, 1988] Gidfford, D.K., Needham, R.M., and Schroeder, M.D.: “The Cedar
File System,” commun. of the ACM, vol. 3 1, pp. 288-298, March 1988.
[Levy, 1990B] LEVY, E., and SILBERSCHATZ, A.: “Distributed File Systems:
Concepts and Examples” Computing Surveys, ~01.22, pp. 321-374, Dec. 1990.
[ Tanenbaum, 1992] Andrew S. Tanenbaum: “Modem Operating Systems” Prentice-Hall
Inc., chapterl3, pp. 549-587, 1992.
[Nelson, 1988] Nelson, M.N., Welch, B.B., and Ousterhout. J.K.: “Caching in the Sprite
Network File System.” ACM Trans. on Computer Systems, vol. 6, pp. 134-154, Feb.
1988.
[Richard, 1997] Richard S. Vermut, “File Caching on the Internet: Technical
Infringement or Safeguard for EfficientNetwork Operation?” 4 J. Intell. Prop. L. 273
(1997)