Download 2. Overview of Operating Systems

Microlink Information Technology Department of Computer Science Operating Systems Prepared By: Tewdros Sisay (M.Sc in Computer Science) May 2008 Microlink Information Technology College Mekelle Branch Microlink Information Technology College Mekelle Branch Department of Software Engineering Operating Systems Acronyms ACL CPU FCFS FIFO I/O LRU LSI LWP MRU OS PC PCB PFF RAM SJF SPN SRT TLB Access Control List Central Processing Unit First-Come-First-Served First-In-First-Out Input/Output List Recently Used Large Scale Integration Lightweight Processes Most Recently Used Operating System Program Counter Process Control Block Page-Fault Frequency Random Access Memory Shortest-Job-First Shortest-Process-Next Shortest-Remaining-Time Translation Look aside Buffer ii List of Tables Table 4.1 Initial state of Banker’s algorithm Table 4.2 Safe state of Banker’s algorithm Table 4.3 Deadlock in Banker’s algorithm Table 4.4 Unsafe state may not lead to deadlock Table 4.5 Process scheduling exercise Table 6.1 Wasteful allocation of disk space iii List of Figures Figure 3.1 Creation of thread vs process Figure 3.2 Implementation of threads in Solaris 2 Figure 4.1 Critical section Figure 5.1 Memory allocation Figure 5.2 Paging in memory management Figure 5.3 Allocation of pages in page frames Figure 5.4 Pages table management (a) Figure 5.5 Pages table management (b) Figure 5.6 Page frame allocation vs page fault rate Figure 5.7 Number of process vs CPU utilization (a) Figure 5.8 Number of process vs CPU utilization (b) Figure 5.9 CLOCK algorithm flowchart Figure 5.10 Implementation of memory allocation in Multics Figure 6.1 Structure of I/O system Figure 6.2 Device I/O addressing Figure 6.3 I/O and CPU processing Figure 6.4 Configuration of Hard-disk iv Preface This teaching material is prepared to support the Operating System course offered in Computer Science programs. It has been organized from textbooks, reference books, handouts prepared for the course, Internet sources and other relevant materials. It covers the basic design principles like the process concept, process management, inter-process communication & synchronization, memory management, I/O management, file management, and security. It also includes practical implementation examples, and exercises at all relevant parts of the text. v Table of Contents 1. Introduction ............................................................................................................................1 1.1 Terminology...................................................................................................................... 1 1.2 Computer Systems Operation ........................................................................................... 1 1.3 Evolution of Operating Systems ....................................................................................... 2 1.4 Operating System Structure .............................................................................................. 3 2. Overview of Operating Systems ...........................................................................................5 2.1 Components of Operating Systems ................................................................................... 5 2.2 Operating Systems Services .............................................................................................. 7 2.3 Characteristics of Operating Systems ............................................................................... 9 3. Process Description ..............................................................................................................11 3.1 The Process Concept ....................................................................................................... 11 3.2 Process States .................................................................................................................. 12 3.3 Threads ............................................................................................................................ 16 3.4 Implementation ............................................................................................................... 21 3.5 Exercises ......................................................................................................................... 23 4. Process Management ...........................................................................................................26 4.1 CPU/Process Scheduling ................................................................................................ 26 4.2 Interprocess Communication .......................................................................................... 31 4.3 Process Synchronization ................................................................................................. 33 4.4 Deadlock ......................................................................................................................... 37 4.5 Implementation ............................................................................................................... 41 4.6 Exercises ......................................................................................................................... 43 5. Memory Management .........................................................................................................45 5.1 Memory Allocation ......................................................................................................... 45 5.2 Swapping......................................................................................................................... 51 5.3 Paging ............................................................................................................................. 52 5.4 Virtual Memory .............................................................................................................. 69 5.5 Segmentation................................................................................................................... 69 5.6 Implementation ............................................................................................................... 71 5.7 Exercise ........................................................................................................................... 77 6. Device Management .............................................................................................................81 6.1 I/O Devices ..................................................................................................................... 81 6.2 Device Addressing .......................................................................................................... 82 6.3 Device Accesses.............................................................................................................. 82 64. Overlapped I/O and CPU Processing .............................................................................. 83 6.5 Disk as an Example Device ............................................................................................ 83 6.6 Disk Controller and Disk Device Driver ........................................................................ 85 6.7 Exercises ......................................................................................................................... 86 7. File Management ..................................................................................................................89 7.1 General Concepts ............................................................................................................ 89 7.2 File System Structure ...................................................................................................... 92 7.3 Access Methods and Protection ...................................................................................... 93 7.4 Implementing File Systems............................................................................................. 98 7.5 Implementation ............................................................................................................. 106 vi 7.6 Exercises ....................................................................................................................... 127 8. Protection and Security .....................................................................................................131 8.1 User Security................................................................................................................. 131 8.2 Access Control Lists ..................................................................................................... 138 8.3 Cryptography ................................................................................................................ 142 8.4 Exercises ....................................................................................................................... 143 Bibliography ...........................................................................................................................145 vii 1. Introduction 1.1 Terminology The 1960’s definition of an operating system is “the software that controls the hardware”. However, today, due to microcode we need a better definition. We see an operating system as the programs that make the hardware useable. In brief, an operating system is the set of programs that controls a computer. Some examples of operating systems are UNIX, Mach, MS-DOS, MSWindows, Windows/NT, Chicago, OS/2, MacOS, VMS, MVS, and VM. Controlling the computer involves software at several levels. We will differentiate kernel services, library services, and application-level services, all of which are part of the operating system. Processes run Applications, which are linked together with libraries that perform standard services. The kernel supports the processes by providing a path to the peripheral devices. The kernel responds to service calls from the processes and interrupts from the devices. The core of the operating system is the kernel, a control program that functions in privileged state, an execution context that allows all hardware instructions to be executed, reacting to interrupts from external devices and to service requests and traps from processes. Generally, the kernel is a permanent resident of the computer. It creates and terminates processes and responds to their request for service. 1.2 Computer Systems Operation Operating Systems are resource managers. The main resource is computer hardware in the form of processors, storage, input/output devices, communication devices, and data. Some of the operating system functions are: implementing the user interface, sharing hardware among users, allowing users to share data among themselves, preventing users from interfering with one another, scheduling resources among users, facilitating input/output, recovering from errors, accounting for resource usage, facilitating parallel operations, organizing data for secure and rapid access, and handling network communications. 1 1.3 Evolution of Operating Systems Historically operating systems have been tightly related to the computer architecture, it is good idea to study the history of operating systems from the architecture of the computers on which they run. Operating systems have evolved through a number of distinct phases or generations which corresponds roughly to the decades. The 1940's - First Generations The earliest electronic digital computers had no operating systems. Machines of the time were so primitive that programs were often entered one bit at time on rows of mechanical switches (plug boards). Programming languages were unknown (not even assembly languages). Operating systems were unheard off. The 1950's - Second Generation By the early 1950's, the routine had improved somewhat with the introduction of punch cards. The General Motors Research Laboratories implemented the first operating systems in early 1950's for their IBM 701. The system of the 50's generally ran one job at a time. These were called single-stream batch processing systems because programs and data were submitted in groups or batches. The 1960's - Third Generation The systems of the 1960's were also batch processing systems, but they were able to take better advantage of the computer's resources by running several jobs at once. So operating systems designers developed the concept of multiprogramming in which several jobs are in main memory at once; a processor is switched from job to job as needed to keep several jobs advancing while keeping the peripheral devices in use. For example, on the system with no multiprogramming, when the current job paused to wait for other I/O operation to complete, the CPU simply sat idle until the I/O finished. The solution for this problem that evolved was to partition memory into several pieces, with a different job in each partition. While one job was waiting for I/O to complete, another job could be using the CPU. 2 Another major feature in third-generation operating system was the technique called spooling (simultaneous peripheral operations on line). In spooling, a high-speed device like a disk interposed between a running program and a low-speed device involved with the program in input/output. Instead of writing directly to a printer, for example, outputs are written to the disk. Programs can run to completion faster, and other programs can be initiated sooner when the printer becomes available, the outputs may be printed. Note that spooling technique is much like thread being spun to a spool so that it may be later be unwound as needed. Another feature present in this generation was time-sharing technique, a variant of multiprogramming technique, in which each user has an on-line (i.e., directly connected) terminal. Because the user is present and interacting with the computer, the computer system must respond quickly to user requests, otherwise user productivity could suffer. Timesharing systems were developed to multiprogramming large number of simultaneous interactive users. Fourth Generation With the development of LSI (Large Scale Integration) circuits, chips, operating system entered in the system entered in the personal computer and the workstation age. Microprocessor technology evolved to the point that it becomes possible to build desktop computers as powerful as the mainframes of the 1970s. Two operating systems have dominated the personal computer scene: MS-DOS, written by Microsoft, Inc. for the IBM PC and other machines using the Intel 8088 CPU and its successors, and UNIX, which is dominant on the large personal computers using the Motorola 6899 CPU family. 1.4 Operating System Structure System Calls and System Programs System calls provide an interface between the process and the operating system. System calls allow user-level processes to request some services from the operating system which process itself is not allowed to do. In handling the trap, the operating system will enter in the kernel mode, where it has access to privileged instructions, and can perform the desired service on the behalf of user-level process. It is because of the critical nature of operations that the operating system itself does them every time they are needed. For example, for I/O a process involves a 3 system call telling the operating system to read or write particular area and this request is satisfied by the operating system. System programs provide basic functioning to users so that they do not need to write their own environment for program development (editors, compilers) and program execution (shells). In some sense, they are bundles of useful system calls. Layered Approach Design In this case the system is easier to debug and modify, because changes affect only limited portions of the code, and programmer does not have to know the details of the other layers. Information is also kept only where it is needed and is accessible only in certain ways, so bugs affecting that data are limited to a specific module or layer. Mechanisms and Policies The policies what is to be done while the mechanism specifies how it is to be done. For instance, the timer construct for ensuring CPU protection is mechanism. On the other hand, the decision of how long the timer is set for a particular user is a policy decision. The separation of mechanism and policy is important to provide flexibility to a system. If the interface between mechanism and policy is well defined, the change of policy may affect only a few parameters. On the other hand, if interface between these two is vague or not well defined, it might involve much deeper change to the system. Once the policy has been decided it gives the programmer the choice of using his/her own implementation. Also, the underlying implementation may be changed for a more efficient one without much trouble if the mechanism and policy are well defined. Specifically, separating these two provides flexibility in a variety of ways. First, the same mechanism can be used to implement a variety of policies, so changing the policy might not require the development of a new mechanism, but just a change in parameters for that mechanism from a library of mechanisms. Second, the mechanism can be changed for example, to increase its efficiency or to move to a new platform, without changing the overall policy. 4 2. Overview of Operating Systems 2.1 Components of Operating Systems Even though, not all systems have the same structure many modern operating systems share the same goal of supporting the following types of system components. Process Management. The operating system manages many kinds of activities ranging from user programs to system programs like printer spooler, name servers, file server etc. Each of these activities is encapsulated in a process. A process includes the complete execution context that is the code, data, Program Counter, registers, OS resources in use etc. It is important to note that a process is not a program. A process is only ONE instant of a program in execution. There are many processes can be running the same program. The five major activities of an operating system in regard to process management are  Creation and deletion of user and system processes.  Suspension and resumption of processes.  A mechanism for process synchronization.  A mechanism for process communication.  A mechanism for deadlock handling. Main-Memory Management. Primary-Memory or Main-Memory is a large array of words or bytes. Each word or byte has its own address. Main-memory provides storage that can be access directly by the CPU. That is to say for a program to be executed, it must in the main memory. The major activities of an operating in regard to memory-management are:  Keep track of which part of memory are currently being used and by whom.  Decide which processes are loaded into memory when memory space becomes available.  Allocate and deallocate memory space as needed. File Management. A file is a collected of related information defined by its creator. Computer can store files on the disk (secondary storage), which provide long term storage. Some examples of storage media are magnetic tape, magnetic disk and optical disk. Each of these media has its own properties like speed, capacity, and data transfer rate and access methods. 5 A file system normally organized into directories to ease their use. These directories may contain files and other directions. The five main major activities of an operating system in regard to file management are 1. The creation and deletion of files. 2. The creation and deletion of directions. 3. The support of primitives for manipulating files and directions. 4. The mapping of files onto secondary storage. 5. The back up of files on stable storage media. I/O System Management. I/O subsystem hides the peculiarities of specific hardware devices from the user. Only the device driver knows the peculiarities of the specific device to which it is assigned. Secondary-Storage Management. Generally speaking, systems have several levels of storage, including primary storage, secondary storage and cache storage. Instructions and data must be placed in primary storage or cache to be referenced by a running program. Because main memory is too small to accommodate all data and programs, and its data are lost when power is lost, the computer system must provide secondary storage to back up main memory. Secondary storage consists of tapes, disks, and other media designed to hold information that will eventually be accessed in primary storage (primary, secondary, cache) is ordinarily divided into bytes or words consisting of a fixed number of bytes. Each location in storage has an address; the set of all addresses available to a program is called an address space. The three major activities of an operating system in regard to secondary storage management are: 1. Managing the free space available on the secondary-storage device. 2. Allocation of storage space when new files have to be written. 3. Scheduling the requests for memory access. Networking. A distributed system is a collection of processors that do not share memory, peripheral devices, or a clock. The processors communicate with one another through communication lines called network. The communication-network design must consider routing and connection strategies, and the problems of contention and security. 6 Protection System. If a computer system has multiple users and allows the concurrent execution of multiple processes, then the various processes must be protected from one another's activities. Protection refers to mechanism for controlling the access of programs, processes, or users to the resources defined by a computer system. Command Interpreter System. A command interpreter is an interface of the operating system with the user. The user gives commands with are executed by operating system (usually by turning them into system calls). The main function of a command interpreter is to get and execute the next user specified command. Command-Interpreter is usually not part of the kernel, since multiple command interpreters (shell, in UNIX terminology) may be support by an operating system, and they do not really need to run in kernel mode. There are two main advantages to separating the command interpreter from the kernel. 1. If we want to change the way the command interpreter looks, i.e., we want to change the interface of command interpreter, we are able to do that if the command interpreter is separate from the kernel. That is we cannot change the code of the kernel so we cannot modify the interface. 2. If the command interpreter is a part of the kernel it is possible for a malicious process to gain access to certain part of the kernel that it showed not have to avoid this ugly scenario it is advantageous to have the command interpreter separate from kernel. 2.2 Operating Systems Services Following are the five services provided by operating systems to the convenience of the users. Program Execution. The purpose of a computer system is to allow the user to execute programs. So the operating system provides an environment where the user can conveniently run programs. The user does not have to worry about the memory allocation or multitasking or anything. These things are taken care of by the operating systems. Running a program involves the allocating and deallocating memory, CPU scheduling in case of multiprocessor. These functions cannot be given to the user-level programs. So user-level programs cannot help the user to run programs independently without the help from operating systems. I/O Operations. Each program requires an input and produces output. This involves the use of I/O. The operating systems hides the user the details of underlying hardware for the I/O. All the 7 user sees is that the I/O has been performed without any details. So the operating systems by providing I/O makes it convenient for the users to run programs. For efficiently and protection users cannot control I/O so this service cannot be provided by userlevel programs. File System Manipulation. The output of a program may need to be written into new files or input taken from some files. The operating systems provide this service. The user does not have to worry about secondary storage management. User gives a command for reading or writing to a file and sees his or her task accomplished. Thus operating systems make it easier for user programs to accomplish their task. This service involves secondary storage management. The speed of I/O that depends on secondary storage management is critical to the speed of many programs and hence it is best relegated to the operating systems to manage it than giving individual users the control of it. It is not difficult for the user-level programs to provide these services but for above mentioned reasons it is best if this service s left with operating system. Communications. There are instances where processes need to communicate with each other to exchange information. It may be between processes running on the same computer or running on the different computers. By providing this service the operating system relieves the user of the worry of passing messages between processes. In case where the messages need to be passed to processes on the other computers through a network it can be done by the user programs. The user program may be customized to the specifics of the hardware through which the message transits and provides the service interface to the operating system. Error Detection. An error is one part of the system may cause malfunctioning of the complete system. To avoid such a situation the operating system constantly monitors the system for detecting the errors. This relieves the user of the worry of errors propagating to various part of the system and causing malfunctioning. This service cannot allow to be handled by user programs because it involves monitoring and in cases altering area of memory or deallocation of memory for a faulty process. Or may be relinquishing the CPU of a process that goes into an infinite loop. These tasks are too critical to be handed over to the user programs. A user program if given these privileges can interfere with the correct (normal) operation of the operating systems. 8 2.3 Characteristics of Operating Systems Modern Operating systems generally have following three major goals. Operating systems generally accomplish these goals by running processes in low privilege and providing service calls that invoke the operating system kernel in high-privilege state. To hide details of hardware by creating abstraction. An abstraction is software that hides lower level details and provides a set of higher-level functions. An operating system transforms the physical world of devices, instructions, memory, and time into virtual world that is the result of abstractions built by the operating system. There are several reasons for abstraction. First, the code needed to control peripheral devices is not standardized. Operating systems provide subroutines called device drivers that perform operations on behalf of programs for example, input/output operations. Second, the operating system introduces new functions as it abstracts the hardware. For instance, operating system introduces the file abstraction so that programs do not have to deal with disks. Third, the operating system transforms the computer hardware into multiple virtual computers, each belonging to a different program. Each program that is running is called a process. Each process views the hardware through the lens of abstraction. Fourth, the operating system can enforce security through abstraction. To allocate resources to processes. An operating system controls how processes (the active agents) may access resources (passive entities) Provide a pleasant and effective user interface. The user interacts with the operating systems through the user interface and usually interested in the “look and feel” of the operating system. The most important components of the user interface are the command interpreter, the file system, on-line help, and application integration. The recent trend has been toward increasingly integrated graphical user interfaces that encompass the activities of multiple processes on networks of computers. One can view Operating Systems from two points of views: Resource manager and extended machines. Form Resource manager point of view Operating Systems manage the different parts of the system efficiently and from extended machines point of view Operating Systems provide a virtual machine to users that is more convenient to use. The structurally Operating Systems can be design as a monolithic system, a hierarchy of layers, a virtual machine system, an exokernel, or using the client-server model. The basic concepts of Operating Systems are processes, memory management, I/O management, the file systems, and security. 9 10 3. Process Description The notion of process is central to the understanding of operating systems. There are quite a few definitions presented in the literature, but no "perfect" definition has yet appeared. 3.1 The Process Concept The term "process" was first used by the designers of the MULTICS in 1960's. Since then, the term process is used somewhat interchangeably with 'task' or 'job'. The process has been given many definitions for instance  A program in Execution.  An asynchronous activity.  The 'animated sprit' of a procedure in execution.  The entity to which processors are assigned.  The 'dispatchable' unit. and many more definitions have given. As we can see from above that there is no universally agreed upon definition, but the definition "Program in Execution" seem to be most frequently used. And this is a concept are will use in the present study of operating systems. Now that we agreed upon the definition of process, the question is what is the relation between process and program? It is same beast with different name or when this beast is sleeping (not executing) it is called program and when it is executing becomes process. Well, to be very precise. Process is not the same as program. In the following discussion we point out some of the difference between process and program. As we have mentioned earlier. Process is not the same as program. A process is more than a program code. A process is an 'active' entity as opposed to a program which is considered to be a 'passive' entity. As we all know that a program is an algorithm expressed in some suitable notation, (e.g., programming language). Being a passive, a program is only a part of process. Process, on the other hand, includes:  Current value of Program Counter (PC)  Contents of the processors registers  Value of the variables 11  The process stack (SP) which typically contains temporary data such as subroutine parameter, return address, and temporary variables.  A data section that contains global variables. A process is the unit of work in a system. In Process model, all software on the computer is organized into a number of sequential processes. A process includes PC, registers, and variables. Conceptually, each process has its own virtual CPU. In reality, the CPU switches back and forth among processes. (The rapid switching back and forth is called multiprogramming). 3.2 Process States The process state consist of everything necessary to resume the process execution if it is somehow put aside temporarily. The process state consists of at least following:  Code for the program.  Program's static data.  Program's dynamic data.  Program's procedure call stack.  Contents of general purpose register.  Contents of program counter (PC)  Contents of program status word (PSW).  Operating Systems resource in use. A process goes through a series of discrete process states.  New State: The process being created.  Running State: A process is said to be running if it has the CPU, that is, process actually using the CPU at that particular instant.  Blocked (or waiting) State: A process is said to be blocked if it is waiting for some event to happen such that as an I/O completion before it can proceed. Note that a process is unable to run until some external event happens. 12  Ready State: A process is said to be ready if it use a CPU if one were available. A ready state process is runable but temporarily stopped running to let another process run.  Terminated state: The process has finished execution. The basic Process Operations are process creation and process termination. The details of these operations are described below. Process Creation. In general-purpose systems, some way is needed to create processes as needed during operation. There are four principal events led to processes creation.  System initialization.  Execution of a process Creation System calls by a running process.  A user request to create a new process.  Initialization of a batch job. Foreground processes interact with users. Background processes that stay in background sleeping but suddenly springing to life to handle activity such as email, webpage, printing, and so on. Background processes are called daemons. This call creates an exact clone of the calling process. A process may create a new process by some create process such as 'fork'. It choose to does so, creating process is called parent process and the created one is called the child processes. Only one parent is needed to create a child process. Note that unlike plants and animals that use sexual representation, a process has only one parent. This creation of process (processes) yields a hierarchical structure of processes like one in the figure. Notice that each child has only one parent but each parent may have many children. After the fork, the two processes, the parent and the child, have the same memory image, the same environment strings and the same open files. After a process is created, both the parent and child have their own distinct address space. If either process changes a word in its address space, the change is not visible to the other process. Following are some reasons for creation of a process  User logs on.  User starts a program.  Operating systems creates process to provide service, e.g., to manage printer.  Some program starts another process, e.g., Netscape calls xv to display a picture. 13 Process Termination. A process terminates when it finishes executing its last statement. Its resources are returned to the system, it is purged from any system lists or tables, and its process control block (PCB) is erased i.e., the PCB's memory space is returned to a free memory pool. The new process terminates the existing process, usually due to following reasons:  Normal Exist. Most processes terminates because they have done their job. This call is exist in UNIX.  Error Exist. When process discovers a fatal error. For example, a user tries to compile a program that does not exist.  Fatal Error. An error caused by process due to a bug in program for example, executing an illegal instruction, referring non-existing memory or dividing by zero.  Killed by another Process. A process executes a system call telling the Operating Systems to terminate some other process. In UNIX, this call is killing. In some systems when a process kills all processes it created are killed as well (UNIX does not work this way). Process States. A process goes through a series of discrete process states.  New State. The process being created.  Terminated State. The process has finished execution.  Blocked (waiting) State. When a process blocks, it does so because logically it cannot continue, typically because it is waiting for input that is not yet available. Formally, a process is said to be blocked if it is waiting for some event to happen (such as an I/O completion) before it can proceed. In this state a process is unable to run until some external event happens.  Running State. A process is said t be running if it currently has the CPU, that is, actually using the CPU at that particular instant.  Ready State. A process is said to be ready if it use a CPU if one were available. It is runable but temporarily stopped to let another process run. Logically, the 'Running' and 'Ready' states are similar. In both cases the process is willing to run, only in the case of 'Ready' state, there is temporarily no CPU available for it. The 'Blocked' state is different from the 'Running' and 'Ready' states in that the process cannot run, even if the CPU is available. 14 Process State Transitions. The following are the six possible transitions among above mentioned five states.  Transition 1 occurs when process discovers that it cannot continue. If running process initiates an I/O operation before its allotted time expires, the running process voluntarily relinquishes the CPU. This state transition is: Block (process-name): Running → Block.  Transition 2 occurs when the scheduler decides that the running process has run long enough and it is time to let another process have CPU time. This state transition is: Time-Run-Out (process-name): Running → Ready.  Transition 3 occurs when all other processes have had their share and it is time for the first process to run again. This state transition is: Dispatch (process-name): Ready → Running.  Transition 4 occurs when the external event for which a process was waiting (such as arrival of input) happens. This state transition is: Wakeup (process-name): Blocked → Ready.  Transition 5 occurs when the process is created. This state transition is: Admitted (process-name): New → Ready.  Transition 6 occurs when the process has finished execution. This state transition is: Exit (process-name): Running → Terminated. Process Control Block. A process in an operating system is represented by a data structure known as a process control block (PCB) or process descriptor. The PCB contains important information about the specific process including  The current state of the process i.e., whether it is ready, running, waiting, or whatever.  Unique identification of the process in order to track "which is which" information.  A pointer to parent process.  Similarly, a pointer to child process (if it exists).  The priority of process (a part of CPU scheduling information). 15  Pointers to locate memory of processes.  A register save area.  The processor it is running on. The PCB is a certain store that allows the operating systems to locate key information about a process. Thus, the PCB is the data structure that defines a process to the operating systems. 3.3 Threads Despite of the fact that a thread must execute in process, the process and its associated threads are different concept. Processes are used to group resources together and threads are the entities scheduled for execution on the CPU. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. In many respect, threads are popular way to improve application through parallelism. The CPU switches rapidly back and forth among the threads giving illusion that the threads are running in parallel. Like a traditional process i.e., process with one thread, a thread can be in any of several states (Running, Blocked, Ready or Terminated). Each thread has its own stack. Since thread will generally call different procedures and thus a different execution history. This is why thread needs its own stack. An operating system that has thread facility, the basic unit of CPU utilization is a thread. A thread has or consists of a program counter (PC), a register set, and a stack space. Threads are not independent of one other like processes as a result threads shares with other threads their code section, data section, OS resources also known as task, such as open files and signals. Processes Vs Threads. As we mentioned earlier that in many respect threads operate in the same way as that of processes. Some of the similarities and differences are: Similarities  Like processes threads share CPU and only one thread active (running) at a time.  Like processes, threads within a processes, threads within a processes execute sequentially.  Like processes, thread can create children.  And like process, if one thread is blocked, another thread can run. 16 Differences  Unlike processes, threads are not independent of one another.  Unlike processes, all threads can access every address in the task .  Unlike processes, thread are design to assist one other. Note that processes might or might not assist one another because processes may originate from different users. Following are some reasons why we use threads in designing operating systems. 1. A process with multiple threads make a great server for example printer server. 2. Because threads can share common data, they do not need to use interprocess communication. 3. Because of the very nature, threads can take advantage of multiprocessors. Threads are cheap in the sense that 1. They only need a stack and storage for registers therefore, threads are cheap to create. 2. Threads use very little resources of an operating system in which they are working. That is, threads do not need new address space, global data, program code or operating system resources. 3. Context switching are fast when working with threads. The reason is that we only have to save and/or restore PC, SP and registers. But this cheapness does not come free - the biggest drawback is that there is no protection between threads. User-Level Threads. User-level threads implement in user-level libraries, rather than via systems calls, so thread switching does not need to call operating system and to cause interrupt to the kernel. In fact, the kernel knows nothing about user-level threads and manages them as if they were single-threaded processes. Advantages. The most obvious advantage of this technique is that a user-level threads package can be implemented on an Operating System that does not support threads. Some other advantages are  User-level threads do not require modification to operating systems. 17  Simple Representation: Each thread is represented simply by a PC, registers, stack and a small control block, all stored in the user process address space.  Simple Management: This simply means that creating a thread, switching between threads and synchronization between threads can all be done without intervention of the kernel.  Fast and Efficient: Thread switching is not much more expensive than a procedure call. Disadvantage. There is a lack of coordination between threads and operating system kernel. Therefore, process as whole gets one time slice irrespect of whether process has one thread or 1000 threads within. It is up to each thread to relinquish control to other threads. User-level threads require non-blocking systems call i.e., a multithreaded kernel. Otherwise, entire process will blocked in the kernel, even if there are runable threads left in the processes. For example, if one thread causes a page fault, the process blocks. Kernel-Level Threads. In this method, the kernel knows about and manages the threads. No runtime system is needed in this case. Instead of thread table in each process, the kernel has a thread table that keeps track of all threads in the system. In addition, the kernel also maintains the traditional process table to keep track of processes. Operating Systems kernel provides system call to create and manage threads. Advantages. Because kernel has full knowledge of all threads, Scheduler may decide to give more time to a process having large number of threads than process having small number of threads. Kernel-level threads are especially good for applications that frequently block. Disadvantages. The kernel-level threads are slow and inefficient. For instance, threads operations are hundreds of times slower than that of user-level threads. Since kernel must manage and schedule threads as well as processes, it require a full thread control block (TCB) for each thread to maintain information about threads. As a result there is significant overhead and increased in kernel complexity. Advantages of Threads over Multiple Processes.  Context Switching. Threads are very inexpensive to create and destroy, and they are inexpensive to represent. For example, they require space to store, the PC, the SP, and the general-purpose registers, but they do not require space to share memory information, Information about open files of I/O devices in use, etc. With so little context, it is much 18 faster to switch between threads. In other words, it is relatively easier for a context switch using threads.  Sharing. Treads allow the sharing of a lot resources that cannot be shared in process, for example, sharing code section, data section, Operating System resources like open file etc. Disadvantages of Threads over Multiple Processes.  Blocking. The major disadvantage if that if the kernel is single threaded, a system call of one thread will block the whole process and CPU may be idle during the blocking period.  Security. Since there is, an extensive sharing among threads there is a potential problem of security. It is quite possible that one thread over writes the stack of another thread (or damaged shared data) although it is very unlikely since threads are meant to cooperate on a single task. Application that Benefits from Threads A proxy server satisfying the requests for a number of computers on a LAN would be benefited by a multi-threaded process. In general, any program that has to do more than one task at a time could benefit from multitasking. For example, a program that reads input, process it, and outputs could have three threads, one for each task. Application that cannot benefit from Threads Any sequential process that cannot be divided into parallel task will not benefit from thread, as they would block until the previous one completes. For example, a program that displays the time of the day would not benefit from multiple threads. Resources used in Thread Creation and Process Creation When a new thread is created it shares its code section, data section and operating system resources like open files with other threads. But it is allocated its own stack, register set and a program counter. The creation of a new process differs from that of a thread mainly in the fact that all the shared resources of a thread are needed explicitly for each process. 19 Figure 3.1 Creation of thread vs process So though two processes may be running the same piece of code they need to have their own copy of the code in the main memory to be able to run. Two processes also do not share other resources with each other. This makes the creation of a new process very costly compared to that of a new thread. Context Switch To give each process on a multi-programmed machine a fair share of the CPU, a hardware clock generates interrupts periodically. This allows the operating system to schedule all processes in main memory (using scheduling algorithm) to run on the CPU at equal intervals. Each time a clock interrupt occurs, the interrupt handler checks how much time the current running process has used. If it has used up its entire time slice, then the CPU scheduling algorithm (in kernel) picks a different process to run. Each switch of the CPU from one process to another is called a context switch. Major Steps of Context Switching  The values of the CPU registers are saved in the process table of the process that was running just before the clock interrupt occurred.  The registers are loaded from the process picked by the CPU scheduler to run next. In a multi-programmed uni-processor computing system, context switches occur frequently enough that all processes appear to be running concurrently. If a process has more than one thread, the Operating System can use the context switching technique to schedule the threads so they appear to execute in parallel. This is the case if threads are implemented at the kernel level. Threads can also be implemented entirely at the user level in run-time libraries. Since in this case no thread scheduling is provided by the Operating System, it is the responsibility of the 20 programmer to yield the CPU frequently enough in each thread so all threads in the process can make progress. Action of Kernel to Context Switch Among Threads The threads share a lot of resources with other peer threads belonging to the same process. So a context switch among threads for the same process is easy. It involves switch of register set, the program counter and the stack. It is relatively easy for the kernel to accomplish this task. Action of kernel to Context Switch Among Processes Context switches among processes are expensive. Before a process can be switched its process control block (PCB) must be saved by the operating system. The PCB consists of the following information:  The process state.  The program counter, PC.  The values of the different registers.  The CPU scheduling information for the process.  Memory management information regarding the process.  Possible accounting information for this process.  I/O status information of the process. When the PCB of the currently executing process is saved the operating system loads the PCB of the next process that has to be run on CPU. This is a heavy task and it takes a lot of time. 3.4 Implementation The Solaris-2 Operating Systems is a multithreaded operating environment with threads at userlevel, Intermediate-level and kernel-level. It also supports symmetric multiprocessing and realtime scheduling. The entire thread system in Solaris is depicted in following figure. 21 Figure 3.2 Implementation of threads in Solaris 2 At user-level  The user-level threads are supported by a library for the creation and scheduling and kernel knows nothing of these threads.  These user-level threads are supported by lightweight processes (LWPs). Each LWP is connected to exactly one kernel-level thread is independent of the kernel.  Many user-level threads may perform one task. These threads may be scheduled and switched among LWPs without intervention of the kernel.  User-level threads are extremely efficient because no context switch is needs to block one thread another to start running. Resource needs of User-level Threads  A user-thread needs a stack and program counter. Absolutely no kernel resource are required.  Since the kernel is not involved in scheduling these user-level threads, switching among user-level threads are fast and efficient. At Intermediate-level The lightweight processes (LWPs) are located between the user-level threads and kernel-level threads. These LWPs serve as a "Virtual CPUs" where user-threads can run. Each task contains at least one LWP. The user-level threads are multiplexed on the LWPs of the process. 22 Resource needs of LWP A LWP contains a process control block (PCB) with register data, accounting information and memory information. Therefore, switching between LWPs requires quite a bit of work and LWPs are relatively slow as compared to user-level threads. At kernel-level The standard kernel-level threads execute all operations within the kernel. There is a kernel-level thread for each LWP and there are some threads that run only on the kernels behalf and have associated LWP. For example, a thread to service disk requests. By request, a kernel-level thread can be pinned to a processor (CPU). See the rightmost thread in figure. The kernel-level threads are scheduled by the kernel's scheduler and user-level threads blocks. In modern solaris-2 a task no longer must block just because a kernel-level threads blocks, the processor (CPU) is free to run another thread. Resource needs of Kernel-level Thread A kernel thread has only small data structure and stack. Switching between kernel threads does not require changing memory access information and therefore, kernel-level threads are relating fast and efficient. 3.5 Exercises 1. Palm OS provides no means of concurrent processing. Discuss three major complications that concurrent processing adds to an operating system. Answer: (a) A method of time sharing must be implemented to allow each of several processes to have access to the system. This method involves the preemption of processes that do not voluntarily give up the CPU and the kernel being reentrant. (b) Processes and system resources must have protections and must be protected from each other. Any given process must be limited in the amount of memory it can use and the operations it can perform on devices like disks. (c) Care must be taken in the kernel to prevent deadlocks between processes, so processes aren’t waiting for each other’s allocated resources. 23 2. When a process in Linux OS creates a new process using the fork () operation, which of the following state is shared between the parent process and the child process? Stack, Heap or Shared memory segments. Answer: Only the shared memory segments are shared between the parent process and the newly forked child process. Copies of the stack and the heap are made for the newly created process. 3. The Sun UltraSPARC processor has multiple register sets. Describe the actions of a context switch if the new context is already loaded into one of the register sets. What else must happen if the new context is in memory rather than in a register set and all the register sets are in use? Answer: The CPU current-register-set pointer is changed to point to the set containing the new context, which takes very little time. If the context is in memory, one of the contexts in a register set must be chosen and be moved to memory, and the new context must be loaded from memory into the set. This process takes a little more time than on systems with one set of registers, depending on how a replacement victim is selected. 4. Provide two programming examples in which multithreading provides better performance than a single-threaded solution. Answer: (a) A Web server that services each request in a separate thread. (b) A parallelized application such as matrix multiplication where different parts of the matrix may be worked on in parallel. 5. What are the two differences between user-level threads and kernel-level threads? Under what circumstances is one type better than the other? Answer: (a) User-level threads are unknown by the kernel, whereas the kernel is aware of kernel threads. 24 (b) On systems using either M:1 or M:N mapping, user threads are scheduled by the thread library and the kernel schedules kernel threads. (c) Kernel threads need not be associated with a process whereas every user thread belongs to a process. Kernel threads are generally more expensive to maintain than user threads as they must be represented with a kernel data structure. 25 4. Process Management 4.1 CPU/Process Scheduling The assignment of physical processors to processes allows processors to accomplish work. The problem of determining, when processors should be assigned and to which processes, is called processor scheduling or CPU scheduling. When more than one process is runnable, the operating system must decide which one first. The part of the operating system concerned with this decision is called the scheduler, and algorithm it uses is called the scheduling algorithm. Goals of scheduling (objectives). Many objectives must be considered in the design of a scheduling discipline. In particular, a scheduler should consider fairness, efficiency, response time, turnaround time, throughput, etc., Some of these goals depends on the system one is using for example batch system, interactive system or real-time system, etc. but there are also some goals that are desirable in all systems. These goals are described as below.  Fairness. Fairness is important under all circumstances. A scheduler makes sure that each process gets its fair share of the CPU and no process can suffer indefinite postponement. Note that giving equivalent or equal time is not fair. Think of safety control and payroll at a nuclear plant.  Policy Enforcement. The scheduler has to make sure that system's policy is enforced. For example, if the local policy is safety then the safety control processes must be able to run whenever they want to, even if it means delay in payroll processes.  Efficiency. Scheduler should keep the system (or in particular CPU) busy cent percent of the time when possible. If the CPU and all the Input/Output devices can be kept running all the time, more work gets done per second than if some components are idle.  Response Time. A scheduler should minimize the response time for interactive user.  Turnaround A scheduler should minimize the time batch users must wait for an output.  Throughput. A scheduler should maximize the number of jobs processed per unit time. A little thought will show that some of these goals are contradictory. It can be shown that any scheduling algorithm that favors some class of jobs hurts another class of jobs. The amount of CPU time available is finite, after all. 26 Preemptive Vs Non-preemptive Scheduling The Scheduling algorithms can be divided into two categories with respect to how they deal with clock interrupts. Non-preemptive Scheduling. A scheduling discipline is non-preemptive if, once a process has been given the CPU, the CPU cannot be taken away from that process. Following are some characteristics of non-preemptive scheduling  In non-preemptive system, short jobs are made to wait by longer jobs but the overall treatment of all processes is fair.  In non-preemptive system, response times are more predictable because incoming high priority jobs can not displace waiting jobs.  In non-preemptive scheduling, a scheduler executes jobs when a process switches from running state to the waiting state, and when a process terminates. Preemptive Scheduling. A scheduling discipline is preemptive if, once a process has been given the CPU can taken away. The strategy of allowing processes that are logically runable to be temporarily suspended is called Preemptive Scheduling and it is contrast to the "run to completion" method. Scheduling Algorithms There are many process scheduling algorithms. Some of them are described as below  First-Come-First-Served (FCFS) Scheduling. Other names of this algorithm are: First-InFirst-Out (FIFO), Run-to-Completion, Run-Until-Done. Perhaps, First-Come-First-Served algorithm is the simplest scheduling algorithm. Processes are dispatched according to their arrival time on the ready queue. Being a non-preemptive discipline, once a process has a CPU, it runs to completion. The FCFS scheduling is fair in the formal sense or human sense of fairness but it is unfair in the sense that long jobs make short jobs wait and unimportant jobs make important jobs wait. FCFS is more predictable than most of other schemes since it offers time. FCFS scheme is not useful in scheduling interactive users because it cannot guarantee good response time. The code for FCFS scheduling is simple to write and understand. One of the major drawbacks of this scheme is that the average time is often quite long. 27 The First-Come-First-Served algorithm is rarely used as a master scheme in modern operating systems but it is often embedded within other schemes. One of the oldest, simplest, fairest and most widely used algorithms is round robin (RR). In the round robin scheduling, processes are dispatched in a FIFO manner but are given a limited amount of CPU time called a time-slice or a quantum. If a process does not complete before its CPU-time expires, the CPU is preempted and given to the next process waiting in a queue. The preempted process is then placed at the back of the ready list. Round Robin Scheduling is preemptive (at the end of time-slice) therefore it is effective in time-sharing environments in which the system needs to guarantee reasonable response times for interactive users. The only interesting issue with round robin scheme is the length of the quantum. Setting the quantum too short causes too many context switches and lower the CPU efficiency. On the other hand, setting the quantum too long may cause poor response time and approximates FCFS. In any event, the average waiting time under round robin scheduling is often quite long.  Shortest-Job-First (SJF) Scheduling. Other name of this algorithm is Shortest-ProcessNext (SPN). Shortest-Job-First (SJF) is a non-preemptive discipline in which waiting job (or process) with the smallest estimated run-time-to-completion is run next. In other words, when CPU is available, it is assigned to the process that has smallest next CPU burst. The SJF scheduling is especially appropriate for batch jobs for which the run times are known in advance. Since the SJF scheduling algorithm gives the minimum average time for a given set of processes, it is probably optimal. The SJF algorithm favors short jobs (or processors) at the expense of longer ones. The obvious problem with SJF scheme is that it requires precise knowledge of how long a job or process will run, and this information is not usually available. The best SJF algorithm can do is to rely on user estimates of run times. In the production environment where the same jobs run regularly, it may be possible to provide reasonable estimate of run time, based on the past performance of the process. But in the development environment users rarely know how their program will execute. 28 Like FCFS, SJF is non preemptive therefore, it is not useful in timesharing environment in which reasonable response time must be guaranteed.  Shortest-Remaining-Time (SRT) Scheduling. The SRT is the preemptive counterpart of SJF and useful in time-sharing environment. In SRT scheduling, the process with the smallest estimated run-time to completion is run next, including new arrivals. In SJF scheme, once a job begins executing, it run to completion. In SJF scheme, a running process may be preempted by a new arrival process with shortest estimated run-time. The algorithm SRT has higher overhead than its counterpart SJF. The SRT must keep track of the elapsed time of the running process and must handle occasional preemptions. In this scheme, arrival of small processes will run almost immediately. However, longer jobs have even longer mean waiting time.  Priority Scheduling. The basic idea is straightforward: each process is assigned a priority, and priority is allowed to run. Equal-Priority processes are scheduled in FCFS order. The shortest-Job-First (SJF) algorithm is a special case of general priority scheduling algorithm. An SJF algorithm is simply a priority algorithm where the priority is the inverse of the (predicted) next CPU burst. That is, the longer the CPU burst, the lower the priority and vice versa. Priority can be defined either internally or externally. Internally defined priorities use some measurable quantities or qualities to compute priority of a process. Examples of Internal priorities are Time limits, Memory requirements, File requirements like for example, number of open files and CPU Vs I/O requirements. Externally defined priorities are set by criteria that are external to operating system such as the importance of process, type or amount of funds being paid for computer use, the department sponsoring the work and Politics. Priority scheduling can be either preemptive or non preemptive. A preemptive priority algorithm will preemptive the CPU if the priority of the newly arrival process is higher than the priority of the currently running process. 29 A non-preemptive priority algorithm will simply put the new process at the head of the ready queue. A major problem with priority scheduling is indefinite blocking or starvation. A solution to the problem of indefinite blockage of the low-priority process is aging. Aging is a technique of gradually increasing the priority of processes that wait in the system for a long period of time.  Multilevel Queue Scheduling. A multilevel queue scheduling algorithm partitions the ready queue in several separate queues, for instance in a multilevel queue scheduling processes are permanently assigned to one queues. The processes are permanently assigned to one another, based on some property of the process, such as Memory size, Process priority and Process type. Algorithm choose the process from the occupied queue that has the highest priority, and run that process either Preemptive or Non-preemptively. Each queue has its own scheduling algorithm or policy. Possibility I. If each queue has absolute priority over lower-priority queues then no process in the queue could run unless the queue for the highest-priority processes were all empty. For example, in the above figure no process in the batch queue could run unless the queues for system processes, interactive processes, and interactive editing processes will all empty. Possibility II. If there is a time slice between the queues then each queue gets a certain amount of CPU times, which it can then schedule among the processes in its queue. For instance; 80% of the CPU time to foreground queue using RR and 20% of the CPU time to background queue using FCFS. Since processes do not move between queue so, this policy has the advantage of low scheduling overhead, but it is inflexible.  Multilevel Feedback Queue Scheduling. Multilevel feedback queue-scheduling algorithm allows a process to move between queues. It uses many ready queues and associate a different priority with each queue. The Algorithm chooses to process with highest priority from the occupied queue and run that process either preemptively or non-preemptively. If the process uses too much CPU time it will moved to a lower-priority queue. Similarly, a process that wait too long in the lower-priority queue may be moved to a higher-priority queue may be moved to a highest-priority queue. Note that this form of aging prevents starvation. 30 For example, a process entering the ready queue is placed in queue 0. If it does not finish within 8 milliseconds time, it is moved to the tail of queue 1. If it does not complete, it is preempted and placed into queue 2. Processes in queue 2 run on a FCFS basis, only when 2 run on a FCFS basis queue, only when queue 0 and queue 1 are empty. 4.2 Interprocess Communication Since processes frequently need to communicate with other processes therefore, there is a need for a well-structured communication, without using interrupts, among processes. Race Conditions In operating systems, processes that are working together share some common storage (main memory, file etc.) that each process can read and write. When two or more processes are reading or writing some shared data and the final result depends on who runs precisely when, are called race conditions. Concurrently executing threads that share data need to synchronize their operations and processing in order to avoid race condition on shared data. Only one ‘customer’ thread at a time should be allowed to examine and update the shared variable. Race conditions are also possible in Operating Systems. If the ready queue is implemented as a linked list and if the ready queue is being manipulated during the handling of an interrupt, then interrupts must be disabled to prevent another interrupt before the first one completes. If interrupts are not disabled than the linked list could become corrupt. Critical Section Figure 4.1 Critical section The key to preventing trouble involving shared storage is find some way to prohibit more than one process from reading and writing the shared data simultaneously. That part of the program where the shared memory is accessed is called the Critical Section. To avoid race conditions and flawed results, one must identify codes in Critical Sections in each thread. The characteristic properties of the code that form a Critical Section are 31  Codes that reference one or more variables in a “read-update-write” fashion while any of those variables is possibly being altered by another thread.  Codes that alter one or more variables that are possibly being referenced in “read-updatawrite” fashion by another thread.  Codes use a data structure while any part of it is possibly being altered by another thread.  Codes alter any part of a data structure while it is possibly in use by another thread. Here, the important point is that when one process is executing shared modifiable data in its critical section, no other process is to be allowed to execute in its critical section. Thus, the execution of critical sections by the processes is mutually exclusive in time. Mutual Exclusion A way of making sure that if one process is using a shared modifiable data, the other processes will be excluded from doing the same thing. Formally, while one process executes the shared variable, all other processes desiring to do so at the same time moment should be kept waiting; when that process has finished executing the shared variable, one of the processes waiting; while that process has finished executing the shared variable, one of the processes waiting to do so should be allowed to proceed. In this fashion, each process executing the shared data (variables) excludes all others from doing so simultaneously. This is called Mutual Exclusion. Note that mutual exclusion needs to be enforced only when processes access shared modifiable data - when processes are performing operations that do not conflict with one another they should be allowed to proceed concurrently. Mutual Exclusion Conditions If we could arrange matters such that no two processes were ever in their critical sections simultaneously, we could avoid race conditions. We need four conditions to hold to have a good solution for the critical section problem (mutual exclusion).  No two processes may at the same moment inside their critical sections.  No assumptions are made about relative speeds of processes or number of CPUs.  No process should be outside its critical section should block other processes. 32  No process should wait arbitrary long to enter its critical section. 4.3 Process Synchronization The mutual exclusion problem is to devise a pre-protocol (or entry protocol) and a post-protocol (or exist protocol) to keep two or more threads from being in their critical sections at the same time. Tanenbaum examine proposals for critical-section problem or mutual exclusion problem. Problem. When one process is updating shared modifiable data in its critical section, no other process should allowed to enter in its critical section. Proposal 1 -Disabling Interrupts (Hardware Solution) Each process disables all interrupts just after entering in its critical section and re-enable all interrupts just before leaving critical section. With interrupts turned off the CPU could not be switched to other process. Hence, no other process will enter its critical and mutual exclusion achieved. Disabling interrupts is sometimes a useful interrupts is sometimes a useful technique within the kernel of an operating system, but it is not appropriate as a general mutual exclusion mechanism for users process. The reason is that it is unwise to give user process the power to turn off interrupts. Proposal 2 - Lock Variable (Software Solution) In this solution, we consider a single, shared, (lock) variable, initially 0. When a process wants to enter in its critical section, it first test the lock. If lock is 0, the process first sets it to 1 and then enters the critical section. If the lock is already 1, the process just waits until (lock) variable becomes 0. Thus, a 0 means that no process in its critical section, and 1 means hold your horses some process is in its critical section. The flaw in this proposal can be best explained by example. Suppose process A sees that the lock is 0. Before it can set the lock to 1 another process B is scheduled, runs, and sets the lock to 1. When the process A runs again, it will also set the lock to 1, and two processes will be in their critical section simultaneously. Proposal 3 - Strict Alteration In this proposed solution, the integer variable 'turn' keeps track of whose turn is to enter the critical section. Initially, process A inspects turn, finds it to be 0, and enters in its critical section. 33 Process B also finds it to be 0 and sits in a loop continually testing 'turn' to see when it becomes 1.Continuously testing a variable waiting for some value to appear is called the Busy-Waiting. Taking turns is not a good idea when one of the processes is much slower than the other. Suppose process 0 finishes its critical section quickly, so both processes are now in their noncritical section. This situation violates above mentioned condition 3. Using Systems calls 'sleep' and 'wakeup' Basically, what above mentioned solution do is this: when a process wants to enter into its critical section, it checks to see if the entry is allowed. If it is not, the process goes into tight loop and waits (i.e., start busy waiting) until it is allowed to enter. This approach waste CPU-time. Now look at some interprocess communication primitives is the pair of steep-wakeup.  Sleep. It is a system call that causes the caller to block, that is, be suspended until some other process wakes it up.  Wakeup. It is a system call that wakes up the process. Both 'sleep' and 'wakeup' system calls have one parameter that represents a memory address used to match up 'sleeps' and 'wakeups'. The Bounded Buffer Producers and Consumers The bounded buffer producers and consumers assume that there is a fixed buffer size i.e., a finite numbers of slots is available. Statement. To suspend the producers when the buffer is full, to suspend the consumers when the buffer is empty, and to make sure that only one process at a time manipulates a buffer so there are no race conditions or lost updates. As an example how sleep-wakeup system calls are used, consider the producer-consumer problem also known as bounded buffer problem. Two processes share a common, fixed-size (bounded) buffer. The producer puts information into the buffer and the consumer takes information out. Trouble arises when 34  The producer wants to put a new data in the buffer, but buffer is already full. Solution: Producer goes to sleep and to be awakened when the consumer has removed data.  The consumer wants to remove data the buffer but buffer is already empty. Solution: Consumer goes to sleep until the producer puts some data in buffer and wakes consumer up. This approach also leads to same race conditions we have seen in earlier approaches. Race condition can occur due to the fact that access to 'count' is unconstrained. The essence of the problem is that a wakeup call, sent to a process that is not sleeping, is lost. Semaphores E.W. Dijkstra (1965) abstracted the key notion of mutual exclusion in his concepts of semaphores. A semaphore is a protected variable whose value can be accessed and altered only by the operations P and V and initialization operation called 'Semaphoiinitislize'. Binary Semaphores can assume only the value 0 or the value 1 counting semaphores also called general semaphores can assume only nonnegative values. The P (or wait or sleep or down) operation on semaphores S, written as P(S) or wait (S), operates as follows: P(S): IF S > 0 THEN S := S – 1 ELSE (wait on S) The V (or signal or wakeup or up) operation on semaphore S, written as V(S) or signal (S), operates as follows: V(S): IF (one or more process are waiting on S) THEN (let one of these processes proceed) ELSE S := S +1 Operations P and V are done as single, indivisible, atomic action. It is guaranteed that once a semaphore operations has stared, no other process can access the semaphore until operation has completed. Mutual exclusion on the semaphore, S, is enforced within P(S) and V(S). 35 If several processes attempt a P(S) simultaneously, only one process will be allowed to proceed. The other processes will be kept waiting, but the implementation of P and V guarantees that processes will not suffer indefinite postponement. Semaphores solve the lost-wakeup problem. Producer-Consumer Problem Using Semaphores The solution to the producer-consumer problem uses three semaphores, namely full, empty and mutex. The semaphore 'full' is used for counting the number of slots in the buffer that are full. The 'empty' for counting the number of slots that are empty and semaphore 'mutex' to make sure that the producer and consumer do not access modifiable shared section of the buffer simultaneously. Initialization Set full buffer slots to 0. i.e., semaphore Full = 0. Set empty buffer slots to N. i.e., semaphore empty = N. For control access to critical section set mutex to 1. i.e., semaphore mutex = 1. Producer ( ) WHILE (true) produce-Item ( ); P (empty); P (mutex); enter-Item ( ) V (mutex) V (full); 36 Consumer ( ) WHILE (true) P (full) P (mutex); remove-Item ( ); V (mutex); V (empty); consume-Item (Item) 4.4 Deadlock A set of process is in a deadlock state if each process in the set is waiting for an event that can be caused by only another process in the set. In other words, each member of the set of deadlock processes is waiting for a resource that can be released only by a deadlock process. None of the processes can run, none of them can release any resources, and none of them can be awakened. It is important to note that the number of processes and the number and kind of resources possessed and requested are unimportant. The resources may be either physical or logical. Examples of physical resources are Printers, Tape Drivers, Memory Space, and CPU Cycles. Examples of logical resources are Files, Semaphores, and Monitors. The simplest example of deadlock is where process 1 has been allocated non-shareable resources A, say, a tap drive, and process 2 has be allocated non-sharable resource B, say, a printer. Now, if it turns out that process 1 needs resource B (printer) to proceed and process 2 needs resource A (the tape drive) to proceed and these are the only two processes in the system, each is blocked the other and all useful work in the system stops. This situation ifs termed deadlock. The system is in deadlock state because each process holds a resource being requested by the other process neither process is willing to release the resource it holds. Preemptable and Non-preemptable Resources. Resources come in two flavors: preemptable and non-preemptable. A preemptable resource is one that can be taken away from the process with no ill effects. Memory is an example of a preemptable resource. On the other hand, a non-preemptable resource is one that cannot be taken away from process (without causing ill effect). For example, CD resources are not preemptable at an arbitrary moment. Reallocating resources can resolve deadlocks that involve preemptable resources. Deadlocks that involve non-preemptable resources are difficult to deal with. 37 Dealing with Deadlock Problem In general, there are four strategies of dealing with deadlock problem:  The Ostrich Approach. Just ignore the deadlock problem altogether.  Deadlock Detection and Recovery. Detect deadlock and, when it occurs, take steps to recover.  Deadlock Avoidance. Avoid deadlock by careful resource scheduling.  Deadlock Prevention. Prevent deadlock by resource scheduling so as to negate at least one of the four conditions. Deadlock Prevention Havender in his pioneering work showed that since all four of the conditions are necessary for deadlock to occur, it follows that deadlock might be prevented by denying any one of the conditions.  Elimination of “Mutual Exclusion” Condition. The mutual exclusion condition must hold for non-sharable resources. That is, several processes cannot simultaneously share a single resource. This condition is difficult to eliminate because some resources, such as the tap drive and printer, are inherently non-shareable. Note that shareable resources like read-only-file do not require mutually exclusive access and thus cannot be involved in deadlock.  Elimination of “Hold and Wait” Condition. There are two possibilities for elimination of the second condition. The first alternative is that a process request be granted all of the resources it needs at once, prior to execution. The second alternative is to disallow a process from requesting resources whenever it has previously allocated resources. This strategy requires that all of the resources a process will need must be requested at once. The system must grant resources on “all or none” basis. If the complete set of resources needed by a process is not currently available, then the process must wait until the complete set is available. While the process waits, however, it may not hold any resources. Thus the “wait for” condition is denied and deadlocks simply cannot occur. This strategy can lead to serious waste of resources. For example, a program requiring ten tap drives must request and receive all ten derives before it begins executing. If the program needs only one tap drive to begin execution and then does not need the 38 remaining tap drives for several hours. Then substantial computer resources (9 tape drives) will sit idle for several hours. This strategy can cause indefinite postponement (starvation). Since not all the required resources may become available at once.  Elimination of “No-preemption” Condition. The non-preemption condition can be alleviated by forcing a process waiting for a resource that cannot immediately be allocated to relinquish all of its currently held resources, so that other processes may use them to finish. Suppose a system does allow processes to hold resources while requesting additional resources. Consider what happens when a request cannot be satisfied. A process holds resources a second process may need in order to proceed while second process may hold the resources needed by the first process. This is a deadlock. This strategy requires that when a process that is holding some resources is denied a request for additional resources. The process must release its held resources and, if necessary, request them again together with additional resources. Implementation of this strategy denies the “no-preemptive” condition effectively. When a process releases resources the process may lose all its work to that point. One serious consequence of this strategy is the possibility of indefinite postponement (starvation). A process might be held off indefinitely as it repeatedly requests and releases the same resources.  Elimination of “Circular Wait” Condition. The last condition, the circular wait, can be denied by imposing a total ordering on all of the resource types and than forcing, all processes to request the resources in order (increasing or decreasing). This strategy impose a total ordering of all resources types, and to require that each process requests resources in a numerical order (increasing or decreasing) of enumeration. With this rule, the resource allocation graph can never have a cycle.  Now the rule is this: processes can request resources whenever they want to, but all requests must be made in numerical order. A process may request first printer and then a tape drive (order: 2, 4), but it may not request first a plotter and then a printer (order: 3, 2). The problem with this strategy is that it may be impossible to find an ordering that satisfies everyone. Deadlock Avoidance 39 This approach to the deadlock problem anticipates deadlock before it actually occurs. This approach employs an algorithm to access the possibility that deadlock could occur and acting accordingly. This method differs from deadlock prevention, which guarantees that deadlock cannot occur by denying one of the necessary conditions of deadlock. If the necessary conditions for a deadlock are in place, it is still possible to avoid deadlock by being careful when resources are allocated. Perhaps the most famous deadlock avoidance algorithm, due to Dijkstra [1965], is the Banker’s algorithm. Banker’s Algorithm In this analogy, customers are equivalent to processes; units are equivalent to resources like disk space and Banker is equivalent to the Operating System. Customers Used Max Available Units Units A 0 6 B 0 5 10 C 0 4 Table 4.1 Initial state of Banker’s algorithm D 0 7 In the above figure, we see four customers each of whom has been granted a number of credit units. The banker reserved only 10 units rather than 22 units to service them. At certain moment, the situation becomes Available Units Customers Used Max A 1 6 2 B 1 5 C 2 4 Table 4.2 Safe state of Banker’s algorithm D 4 7 Safe state. The key to a state being safe is that there is at least one way for all users to finish. In other analogy, the state of the above figure is safe because with 2 units left, the banker can delay any request except C's, thus letting C finish and release all four resources. With four units in hand, the banker can let either D or B have the necessary units and so on. Unsafe State. Consider what would happen if a request from B for one more unit were granted in the above figure, we would have situation shown in the table below. This is an unsafe state. If all the customers namely A, B, C, and D asked for their maximum loans, then banker could not satisfy any of them and we would have a deadlock. 40 Customers Used A 1 B 2 C 2 D 4 Important Note. It Max Available Units 6 5 1 Table 4.3 Deadlock in Banker’s algorithm 4 7 is important to note that an unsafe state does not imply the existence or even the eventual existence a deadlock. What an unsafe state does imply is simply that some unfortunate sequence of events might lead to a deadlock. The Banker's algorithm is thus to consider each request as it occurs, and see if granting it leads to a safe state. If it does, the request is granted, otherwise, it postponed until later. Haberman [1969] has shown that executing of the algorithm has complexity proportional to N2 where N is the number of processes and since the algorithm is executed each time a resource request occurs, the overhead is significant. Deadlock Detection Deadlock detection is the process of actually determining that a deadlock exists and identifying the processes and resources involved in the deadlock. The basic idea is to check allocation against resource availability for all possible allocation sequences to determine if the system is in deadlocked state a. Of course, the deadlock detection algorithm is only half of this strategy. Once a deadlock is detected, there needs to be a way to recover several alternatives exists:  Temporarily prevent resources from deadlocked processes.  Back off a process to some check point allowing preemption of a needed resource and restarting the process at the checkpoint later.  Successively kill processes until the system is deadlock free. These methods are expensive in the sense that each iteration calls the detection algorithm until the system proves to be deadlock free. The complexity of algorithm is O (N2) where N is the number of proceeds. Another potential problem is starvation; same process killed repeatedly. 4.5 Implementation 41 Case Study 1. Solaris, Windows XP, and Linux implement multiple locking mechanisms because these operating systems provide different locking mechanisms depending on the application developers’ needs. Spinlocks are useful for multiprocessor systems where a thread can run in a busy-loop (for a short period of time) rather than incurring the overhead of being put in a sleep queue. Mutexes are useful for locking resources. Solaris 2 uses adaptive mutexes, meaning that the mutex is implemented with a spin lock on multiprocessor machines. Semaphores and condition variables are more appropriate tools for synchronization when a resource must be held for a long period of time, since spinning is inefficient for a long duration. Case Study 2. Suppose that a system is in an unsafe state. An algorithm an algorithm that shows the possibility for the processes to complete their execution without entering a deadlock state is depicted as below. An unsafe state may not necessarily lead to deadlock; it just means that we cannot guarantee that deadlock will not occur. Thus, it is possible that a system in an unsafe state may still allow all processes to complete without deadlock occurring. Consider the situation where a system has 12 resources allocated among processes P0, P1, and P2. The resources are allocated according to the following policy: Process P0 P1 P2 Max 10 4 9 Current 5 2 3 Need 5 2 6 Table 4.4 Unsafe state may not lead to deadlock Implementation of the above mentioned scenario is described as below. for (int i = 0; i < n; i++) { // first find a thread that can finish for (int j = 0; j < n; j++) { if (!finish[j]) { boolean temp = true; for (int k = 0; k < m; k++) { if (need[j][k] > work[k]) temp = false; } if (temp) { // if this thread can finish finish[j] = true; for (int x = 0; x < m; x++) work[x] += work[j][x]; } } } } 42 Currently there are two resources available. This system is in an unsafe state as process P1 could complete, thereby freeing a total of four resources. But we cannot guarantee that processes P0 and P2 can complete. However, it is possible that a process may release resources before requesting any further. For example, process P2 could release a resource, thereby increasing the total number of resources to five. This allows process P0 to complete, which would free a total of nine resources, thereby allowing process P2 to complete as well. 4.6 Exercises 1. A CPU scheduling algorithm determines an order for the execution of its scheduled processes. Given n processes to be scheduled on one processor, how many possible different schedules are there? Give a formula in terms of n. Answer: n! (n factorial = n × n – 1 × n – 2 × ... × 2 × 1). 2. Define the difference between preemptive and non-preemptive scheduling. Answer: Preemptive scheduling allows a process to be interrupted in the midst of its execution, taking the CPU away and allocating it to another process. Non-preemptive scheduling ensures that a process relinquishes control of the CPU only when it finishes with its current CPU burst. 3. Suppose that the following processes arrive for execution at the times indicated. Each process will run the listed amount of time. In answering the questions, use non-preemptive scheduling and base all decisions on the information you have at the time the decision must be made. Process Arrival Time Burst Time P1 0.0 8 P2 0.4 4 P3 1.0 1 Table 4.5 Process scheduling exercise 43 (a) What is the average turnaround time for these processes with the FCFS scheduling algorithm? (b) What is the average turnaround time for these processes with the SJF scheduling algorithm? (c) The SJF algorithm is supposed to improve performance, but notice that we chose to run process P1 at time 0 because we did not know that two shorter processes would arrive soon. Compute what the average turnaround time will be if the CPU is left idle for the first 1 unit and then SJF scheduling is used. Remember that processes P1 and P2 are waiting during this idle time, so their waiting time may increase. This algorithm could be known as future knowledge scheduling. Answer: a. 10.53 b. 9.53 c. 6.86 Remember that turnaround time is finishing time minus arrival time, so you have to subtract the arrival times to compute the turnaround times. FCFS is 11 if you forget to subtract arrival time. 4. What advantage is there in having different time-quantum sizes on different levels of a multilevel queuing system? Answer: Processes that need more frequent servicing, for instance, interactive processes such as editors, can be in a queue with a small time quantum. Processes with no need for frequent servicing can be in a queue with a larger quantum, requiring fewer context switches to complete the processing, and thus making more efficient use of the computer. 5. Suppose that a scheduling algorithm (at the level of short-term CPU scheduling) favors those processes that have used the least processor time in the recent past. Why will this algorithm favor I/O-bound programs and yet not permanently starve CPU-bound programs? Answer: It will favor the I/O-bound programs because of the relatively short CPU burst request by them; however, the CPU-bound programs will not starve because the I/O-bound programs will relinquish the CPU relatively often to do their I/O. 44 5. Memory Management 5.1 Memory Allocation We first consider how to manage main (``core'') memory (also called random-access memory (RAM)). In general, a memory manager provides two operations: Address allocate (int size); and void deallocate (Address block); The procedure allocate receives a request for a contiguous block of size bytes of memory and returns a pointer to such a block. The procedure deallocate releases the indicated block, returning it to the free pool for reuse. Sometimes a third procedure is also provided, Address reallocate(Address block, int new_size); which takes an allocated block and changes its size, either returning part of it to the free pool or extending it to a larger block. It may not always be possible to grow the block without copying it to a new location, so reallocate returns the new address of the block. Memory allocators are used in a variety of situations. In UNIX, each process has a data segment. There is a system call to make the data segment bigger, but no system call to make it smaller. Also, the system call is quite expensive. Therefore, there are library procedures (called malloc, free, and realloc) to manage this space. Only when malloc or realloc runs out of space is it necessary to make the system call. The C++ operators new and delete are just dressed-up versions of malloc and free. The Java operator new also uses malloc, and the Java runtime system calls free when an object is no found to be inaccessible during garbage collection. The operating system also uses a memory allocator to manage space used for OS data structures and given to `ùser'' processes for their own use. As we saw before, there are several reasons why we might want multiple processes, such as serving multiple interactive users or controlling multiple devices. There is also a ``selfish'' reason why the OS wants to have multiple processes in memory at the same time: to keep the CPU busy. Suppose there are n processes in memory (this is called the level of multiprogramming) and each process is blocked (waiting for I/O) a fraction p of the time. In the best case, when they ``take turns'' being blocked, the CPU will be 100% busy provided n(1-p) >= 1. For example, if each process is ready 20% of the time, p = 0.8 and the CPU could be kept completely busy with five processes. Of course, real processes aren't so cooperative. In the worst case, they could all decide to block at the same time, in which case, 45 the CPU utilization (fraction of the time the CPU is busy) would be only 1 - p (20% in our example). If each processes decides randomly and independently when to block, the chance that all n processes are blocked at the same time is only pn, so CPU utilization is 1 - pn. Continuing our example in which n = 5 and p = 0.8, the expected utilization would be 1 - .85 = 1 - .32768 = 0.67232. In other words, the CPU would be busy about 67% of the time on the average. Algorithms for Memory Management Clients of the memory manager keep track of allocated blocks (for now, we will not worry about what happens when a client ``forgets'' about a block). The memory manager needs to keep track of the ``holes'' between them. The most common data structure is doubly linked list of holes. This data structure is called the free list. This free list doesn't actually consume any space (other than the head and tail pointers), since the links between holes can be stored in the holes themselves (provided each hole is at least as large as two pointers. To satisfy an allocate(n) request, the memory manager finds a hole of size at least n and removes it from the list. If the hole is bigger than n bytes, it can split off the tail of the hole, making a smaller hole, which it returns to the list. To satisfy a deallocate request, the memory manager turns the returned block into a ``hole'' data structure and inserts it into the free list. If the new hole is immediately preceded or followed by a hole, the holes can be coalesced into a bigger hole, as explained below. How does the memory manager know how big the returned block is? The usual trick is to put a small header in the allocated block, containing the size of the block and perhaps some other information. The allocate routine returns a pointer to the body of the block, not the header, so the client doesn't need to know about it. The deallocate routine subtracts the header size from its argument to get the address of the header. The client thinks the block is a little smaller than it really is. So long as the client ``colors inside the lines'' there is no problem, but if the client has bugs and scribbles on the header, the memory manager can get completely confused. This is a frequent problem with malloc in UNIX programs written in C or C++. The Java system uses a variety of runtime checks to prevent this kind of bug. To make it easier to coalesce adjacent holes, the memory manager also adds a flag (called a ``boundary tag'') to the beginning and end of each hole or allocated block, and it records the size of a hole at both ends of the hole. 46 Figure 5.1 Memory allocation When the block is deallocated, the memory manager adds the size of the block (which is stored in its header) to the address of the beginning of the block to find the address of the first word following the block. It looks at the tag there to see if the following space is a hole or another allocated block. If it is a hole, it is removed from the free list and merged with the block being freed, to make a bigger hole. Similarly, if the boundary tag preceding the block being freed indicates that the preceding space is a hole, we can find the start of that hole by subtracting its size from the address of the block being freed (that's why the size is stored at both ends), remove it from the free list, and merge it with the block being freed. Finally, we add the new hole back to the free list. Holes are kept in a doubly-linked list to make it easy to remove holes from the list when they are being coalesced with blocks being freed. How does the memory manager choose a hole to respond to an allocate request? At first, it might seem that it should choose the smallest hole that is big enough to satisfy the request. This strategy is called best fit. It has two problems. First, it requires an expensive search of the entire free list to find the best hole (although fancier data structures can be used to speed up the search). More importantly, it leads to the creation of lots of little holes that are not big enough to satisfy any requests. This situation is called fragmentation, and is a problem for all memorymanagement strategies, although it is particularly bad for best-fit. One way to avoid making little holes is to give the client a bigger block than it asked for. For example, we might round all requests up to the next larger multiple of 64 bytes. That doesn't make the fragmentation go away, 47 it just hides it. Unusable space in the form of holes is called external fragmentation, while unused space inside allocated blocks is called internal fragmentation. Another strategy is first fit, which simply scans the free list until a large enough hole is found. Despite the name, first-fit is generally better than best-fit because it leads to less fragmentation. There is still one problem: Small holes tend to accumulate near the beginning of the free list, making the memory allocator search farther and farther each time. This problem is solved with next fit, which starts each search where the last one left off, wrapping around to the beginning when the end of the list is reached. Yet another strategy is to maintain separate lists, each containing holes of a different size. This approach works well at the application level, when only a few different types of objects are created (although there might be lots of instances of each type). It can also be used in a more general setting by rounding all requests up to one of a few pre-determined choices. For example, the memory manager may round all requests up to the next power of two bytes (with a minimum of, say, 64) and then keep lists of holes of size 64, 128, 256, etc. Assuming the largest request possible is 1 megabyte, this requires only 14 lists. This is the approach taken by most implementations of malloc. This approach eliminates external fragmentation entirely, but internal fragmentation may be as bad as 50% in the worst case, which occurs when all requests are one byte more than a power of two. Another problem with this approach is how to coalesce neighboring holes. One possibility is not to try. The system is initialized by splitting memory up into a fixed set of holes (either all the same size or a variety of sizes). Each request is matched to an `àppropriate'' hole. If the request is smaller than the hole size, the entire hole is allocated to it anyhow. When the allocate block is released, it is simply returned to the appropriate free list. Most implementations of malloc use a variant of this approach. An interesting trick for coalescing holes with multiple free lists is the buddy system. Assume all blocks and holes have sizes which are powers of two and each block or hole starts at an address that is an exact multiple of its size. Then each block has a ``buddy'' of the same size adjacent to it, such that combining a block of size 2n with its buddy creates a properly aligned block of size 2n+1 For example, blocks of size 4 could start at addresses 0, 4, 8, 12, 16, 20, etc. The blocks at 0 and 4 are buddies; combining them gives a block at 0 of length 8. Similarly 8 and 12 are buddies, 16 and 20 are buddies, etc. The blocks at 4 and 8 are not buddies even though they are neighbors: 48 Combining them would give a block of size 8 starting at address 4, which is not a multiple of 8. The address of a block's buddy can be easily calculated by flipping the nth bit from the right in the binary representation of the block's address. For example, the pairs of buddies (0,4), (8,12), (16,20) in binary are (00000,00100), (01000,01100), (10000,10100). In each case, the two addresses in the pair differ only in the third bit from the right. In short, you can find the address of the buddy of a block by taking the exclusive or of the address of the block with its size. To allocate a block of a given size, first round the size up to the next power of two and look on the list of blocks of that size. If that list is empty, split a block from the next higher list (if that list is empty, first add two blocks to it by splitting a block from the next higher list, and so on). When deallocating a block, first check to see whether the block's buddy is free. If so, combine the block with its buddy and add the resulting block to the next higher free list. As with allocations, deallocations can cascade to higher and higher lists. Compaction and Garbage Collection What do you do when you run out of memory? Any of these methods can fail because all the memory is allocated, or because there is too much fragmentation. Malloc, which is being used to allocate the data segment of a UNIX process, just gives up and calls the (expensive) OS call to expand the data segment. A memory manager allocating real physical memory doesn't have that luxury. The allocation attempt simply fails. There are two ways of delaying this catastrophe, compaction and garbage collection. Compaction attacks the problem of fragmentation by moving all the allocated blocks to one end of memory, thus combining all the holes. Aside from the obvious cost of all that copying, there is an important limitation to compaction: Any pointers to a block need to be updated when the block is moved. Unless it is possible to find all such pointers, compaction is not possible. Pointers can stored in the allocated blocks themselves as well as other places in the client of the memory manager. In some situations, pointers can point not only to the start of blocks but also into their bodies. For example, if a block contains executable code, a branch instruction might be a pointer to another location in the same block. Compaction is performed in three phases. First, the new location of each block is calculated to determine the distance the block will be moved. Then each pointer is updated by adding to it the amount that the block it is pointing (in) to will be moved. Finally, the data is actually moved. There are various clever tricks possible to combine these operations. 49 Garbage collection finds blocks of memory that are inaccessible and returns them to the free list. As with compaction, garbage collection normally assumes we find all pointers to blocks, both within the blocks themselves and ``from the outside.'' If that is not possible, we can still do ``conservative'' garbage collection in which every word in memory that contains a value that appears to be a pointer is treated as a pointer. The conservative approach may fail to collect blocks that are garbage, but it will never mistakenly collect accessible blocks. There are three main approaches to garbage collection: reference counting, mark-and-sweep, and generational algorithms. Reference counting keeps in each block a count of the number of pointers to the block. When the count drops to zero, the block may be freed. This approach is only practical in situations where there is some ``higher level'' software to keep track of the counts (it's much too hard to do by hand), and even then, it will not detect cyclic structures of garbage: Consider a cycle of blocks, each of which is only pointed to by its predecessor in the cycle. Each block has a reference count of 1, but the entire cycle is garbage. Mark-and-sweep works in two passes: First we mark all non-garbage blocks by doing a depthfirst search starting with each pointer ``from outside'': void mark(Address b) { mark block b; for (each pointer p in block b) { if (the block pointed to by p is not marked) mark(p); } } The second pass sweeps through all blocks and returns the unmarked ones to the free list. The sweep pass usually also does compaction, as described above. There are two problems with mark-and-sweep. First, the amount of work in the mark pass is proportional to the amount of non-garbage. Thus if memory is nearly full, it will do a lot of work with very little payoff. Second, the mark phase does a lot of jumping around in memory, which is bad for virtual memory systems, as we will soon see. The third approach to garbage collection is called generational collection. Memory is divided into spaces. When a space is chosen for garbage collection, all subsequent references to objects in that space cause the object to be copied to a new space. After a while, the old space either it becomes empty and can be returned to the free list all at once, or at least it becomes so sparse that a mark-and-sweep garbage collection on it will be cheap. As an empirical fact, objects tend 50 to be either short-lived or long-lived. In other words, an object that has survived for a while is likely to live a lot longer. By carefully choosing where to move objects when they are referenced, we can arrange to have some spaces filled only with long-lived objects, which are very unlikely to become garbage. 5.2 Swapping When all else fails, allocate simply fails. In the case of an application program, it may be adequate to simply print an error message and exit. An OS must be able recover more gracefully. We motivated memory management by the desire to have many processes in memory at once. In a batch system, if the OS cannot allocate memory to start a new job, it can ``recover'' by simply delaying starting the job. If there is a queue of jobs waiting to be created, the OS might want to go down the list, looking for a smaller job that can be created right away. This approach maximizes utilization of memory, but can starve large jobs. The situation is analogous to shortterm CPU scheduling, in which SJF gives optimal CPU utilization but can starve long bursts. The same trick works here: aging. As a job waits longer and longer, increase its priority, until its priority is so high that the OS refuses to skip over it looking for a more recently arrived but smaller job. An alternative way of avoiding starvation is to use a memory-allocation scheme with fixed partitions (holes are not split or combined). Assuming no job is bigger than the biggest partition, there will be no starvation, provided that each time a partition is freed, we start the first job in line that is smaller than that partition. However, we have another choice analogous to the difference between first-fit and best fit. Of course we want to use the ``best'' hole for each job (the smallest free partition that is at least as big as the job), but suppose the next job in line is small and all the small partitions are currently in use. We might want to delay starting that job and look through the arrival queue for a job that better uses the partitions currently available. This policy re-introduces the possibility of starvation, which we can combat by aging, as above. If a disk is available, we can also swap blocked jobs out to disk. When a job finishes, we first swap back jobs from disk before allowing new jobs to start. When a job is blocked (either because it wants to do I/O or because our short-term scheduling algorithm says to switch to another job), we have a choice of leaving it in memory or swapping it out. One way of looking at this scheme is that it increases the multiprogramming level (the number of jobs `ìn memory'') at the cost of making it (much) more expensive to switch jobs. A variant of the MLFQ (multi-level 51 feedback queues) CPU scheduling algorithm is particularly attractive for this situation. The queues are numbered from 0 up to some maximum. When a job becomes ready, it enters queue zero. The CPU scheduler always runs a job from the lowest-numbered non-empty queue (i.e., the priority is the negative of the queue number). It runs a job from queue i for a maximum of i quanta. If the job does not block or complete within that time limit, it is added to the next higher queue. This algorithm behaves like RR with short quanta in that short bursts get high priority, but does not incur the overhead of frequent swaps between jobs with long bursts. The number of swaps is limited to the logarithm of the burst size. 5.3 Paging Most modern computers have special hardware called a memory management unit (MMU). This unit sits between the CPU and the memory unit. Whenever the CPU wants to access memory (whether it is to load an instruction or load or store data), it sends the desired memory address to the MMU, which translates it to another address before passing it on the the memory unit. The address generated by the CPU, after any indexing or other addressing-mode arithmetic, is called a virtual address, and the address it gets translated to by the MMU is called a physical address. Figure 5.2 Paging in memory management Normally, the translation is done at the granularity of a page. Each page is a power of 2 bytes long, usually between 1024 and 8192 bytes. If virtual address p is mapped to physical address f (where p is a multiple of the page size), then address p+o is mapped to physical address f+o for any offset o less than the page size. In other words, each page is mapped to a contiguous region of physical memory called a page frame. 52 Figure 5.3 Allocation of pages in page frames The MMU allows a contiguous region of virtual memory to be mapped to page frames scattered around physical memory making life much easier for the OS when allocating memory. Much more importantly, however, it allows infrequently-used pages to be stored on disk. Here's how it works: The tables used by the MMU have a valid bit for each page in the virtual address space. If this bit is set, the translation of virtual addresses on a page proceeds as normal. If it is clear, any attempt by the CPU to access an address on the page generates an interrupt called a page fault trap. The OS has an interrupt handler for page faults, just as it has a handler for any other kind of interrupt. It is the job of this handler to get the requested page into memory. In somewhat more detail, when a page fault is generated for page p1, the interrupt handler does the following:  Find out where the contents of page p1 are stored on disk. The OS keeps this information in a table. It is possible that this page isn't anywhere at all, in which case the memory reference is simply a bug. In this case, the OS takes some corrective action such as killing the process that made the reference (this is source of the notorious message ``memory fault -- core dumped''). Assuming the page is on disk:  Find another page p2 mapped to some frame f of physical memory that is not used much.  Copy the contents of frame f out to disk.  Clear page p2's valid bit so that any subsequent references to page p2 will cause a page fault. 53  Copy page p1's data from disk to frame f.  Update the MMU's tables so that page p1 is mapped to frame f.  Return from the interrupt, allowing the CPU to retry the instruction that caused the interrupt. Page Tables Conceptually, the MMU contains a page table which is simply an array of entries indexed by page number. Each entry contains some flags (such as the valid bit mentioned earlier) and a frame number. The physical address is formed by concatenating the frame number with the offset, which are the low-order bits of the virtual address. Figure 5.4 Pages table management (a) There are two problems with this conceptual view. First, the lookup in the page table has to be fast, since it is done on every single memory reference--at least once per instruction executed (to fetch the instruction itself) and often two or more times per instruction. Thus the lookup is always done by special-purpose hardware. Even with special hardware, if the page table is stored in memory, the table lookup makes each memory reference generated by the CPU cause two references to memory. Since in modern computers, the speed of memory is often the bottleneck (processors are getting so fast that they spend much of their time waiting for memory), virtual memory could make programs run twice as slowly as they would without it. We will look at ways of avoiding this problem in a minute, but first we will consider the other problem: The page tables can get large. Suppose the page size is 4K bytes and a virtual address is 32 bits long (these are typical values for current machines). Then the virtual address would be divided into a 20-bit page number and a 12-bit offset (because 212 = 4096 = 4K), so the page table would have to have 220 = 1,048,576 entries. If each entry is 4 bytes long, that would use up 4 megabytes of memory. And each 54 process has its own page table. Newer machines being introduced now generate 64-bit addresses. Such a machine would need a page table with 4,503,599,627,370,496 entries! Fortunately, the vast majority of the page table entries are normally marked `ìnvalid.'' Although the virtual address may be 32 bits long and thus capable of addressing a virtual address space of 4 gigabytes, a typical process is at most a few megabytes in size, and each megabyte of virtual memory uses only 256 page-table entries (for 4K pages). There are several different page table organizations use in actual computers. One approach is to put the page table entries in special registers. This was the approach used by the PDP-11 minicomputer introduced in the 1970's. The virtual address was 16 bits and the page size was 8K bytes. Thus the virtual address consisted of 3 bits of page number and 13 bits of offset, for a total of 8 pages per process. The eight page-table entries were stored in special registers. As an aside, 16-bit virtual addresses mean that any one process could access only 64K bytes of memory. Even in those days that was considered too small, so later versions of the PDP-11 used a trick called ``split I/D space.'' Each memory reference generated by the CPU had an extra bit indicating whether it was an instruction fetch (I) or a data reference (D), thus allowing 64K bytes for the program and 64K bytes for the data. Putting page table entries in registers helps make the MMU run faster (the registers were much faster than main memory), but this approach has a downside as well. The registers are expensive, so it works for very small page-table size. Also, each time the OS wants to switch processes, it has to reload the registers with the page-table entries of the new process. A second approach is to put the page table in main memory. The (physical) address of the page table is held in a register. The page field of the virtual address is added to this register to find the page table entry in physical memory. This approach has the advantage that switching processes is easy (all you have to do is change the contents of one register) but it means that every memory reference generated by the CPU requires two trips to memory. It also can use too much memory, as we saw above. A third approach is to put the page table itself in virtual memory. The page number extracted from the virtual address is used as a virtual address to find the page table entry. To prevent an infinite recursion, this virtual address is looked up using a page table stored in physical memory. As a concrete example, consider the VAX computer, introduced in the late 70's. The virtual address of the VAX is 30 bits long, with 512-byte pages (probably too small even at that time!) 55 Thus the virtual address a consists of a 21-bit page number p and a nine-bit offset o. The page number is multiplied by 4 (the size of a page-table entry) and added to the contents of the MMU register containing the address of the page table. This gives a virtual address that is resolved using a page table in physical memory to get a frame number f1. In more detail, the high order bits of p index into a table to find a physical frame number, which, when concatenated with the low bits of p give the physical address of a word containing f. The concatenation of f with o is the desired physical address. Figure 5.5 Pages table management (b) As you can see, another way of looking at this algorithm is that the virtual address is split into fields that are used to walk through a tree of page tables. The SPARC processor (which you are using for this course) uses a similar technique, but with one more level: The 32-bit virtual address is divided into three index fields of 8, 6, and 6 bits and a 12-bit offset. The root of the tree is pointed to by an entry in a context table, which has one entry for each process. The advantage of these schemes is that they save on memory. For example, consider a VAX process that only uses the first megabyte of its address space (2048 512-byte pages). Since each second level page table has 128 entries, there will be 16 of them used. Adding to this the 64K bytes needed for the first-level page table, the total space used for page tables is only 72K bytes, rather than the 8 megabytes that would be needed for a one-level page table. The downside is that each level of page table adds one more memory lookup on each reference generated by the CPU. A fourth approach is to use what is called an inverted page table. (Actually, the very first computer to have virtual memory, the Atlas computer built in England in the late 50's used this approach, so in some sense all the page tables described above are `ìnverted.'') An ordinary page table has an entry for each page, containing the address of the corresponding page frame (if any). An inverted page table has an entry for each page frame, containing the corresponding page number. To resolve a virtual address, the table is searched to find an entry that contains the page number. The good news is that an inverted page table only uses a fixed fraction of memory. For 56 example, if a page is 4K bytes and a page-table entry is 4 bytes, there will be exactly 4 bytes of page table for each 4096 bytes of physical memory. In other words, less that 0.1% of memory will be used for page tables. The bad news is that this is by far the slowest of the methods, since it requires a search of the page table for each reference. The original Atlas machine had special hardware to search the table in parallel, which was reasonable since the table had only 2048 entries. All of the methods considered thus far can be sped up by using a trick called caching. We will be seeing many many more examples of caching used to speed things up throughout the course. In fact, it has been said that caching is the only technique in computer science used to improve performance. In this case, the specific device is called a translation lookaside buffer (TLB). The TLB contains a set of entries, each of which contains a page number, the corresponding page frame number, and the protection bits. There is special hardware to search the TLB for an entry matching a given page number. If the TLB contains a matching entry, it is found very quickly and nothing more needs to be done. Otherwise we have a TLB miss and have to fall back on one of the other techniques to find the translation. However, we can take that translation we found the hard way and put it into the TLB so that we find it much more quickly the next time. The TLB has a limited size, so to add a new entry, we usually have to throw out an old entry. The usual technique is to throw out the entry that hasn't been used the longest. This strategy, called LRU (least-recently used) replacement is also implemented in hardware. The reason this approach works so well is that most programs spend most of their time accessing a small set of pages over and over again. For example, a program often spends a lot of time in an `ìnner loop'' in one procedure. Even if that procedure, the procedures it calls, and so on are spread over 40K bytes, 10 TLB entries will be sufficient to describe all these pages, and there will no TLB misses provided the TLB has at least 10 entries. This phenomenon is called locality. In practice, the TLB hit rate for instruction references is extremely high. The hit rate for data references is also good, but can vary widely for different programs. If the TLB performs well enough, it almost doesn't matter how TLB misses are resolved. The IBM Power PC and the HP Spectrum use inverted page tables organized as hash tables in conjunction with a TLB. The MIPS computers (MIPS is now a division of Silicon Graphics) get rid of hardware page tables altogether. A TLB miss causes an interrupt, and it is up to the OS to search the page table and load the appropriate entry into the TLB. The OS typically uses an inverted page table implemented as a software hash table. 57 Two processes may map the same page number to different page frames. Since the TLB hardware searches for an entry by page number, there would be an ambiguity if entries corresponding to two processes were in the TLB at the same time. There are two ways around this problem. Some systems simply flush the TLB (set a bit in all entries marking them as unused) whenever they switch processes. This is very expensive, not because of the cost of flushing the TLB, but because of all the TLB misses that will happen when the new process starts running. An alternative approach is to add a process identifier to each entry. The hardware then searches on for the concatenation of the page number and the process id of the current process. We mentioned earlier that each page-table entry contains a ``valid'' bit as well as some other bits. These other bits include the following.  Protection. At a minimum one bit to flag the page as read-only or read/write. Sometimes more bits to indicate whether the page may be executed as instructions, etc.  Modified. This bit, usually called the dirty bit, is set whenever the page is referenced by a write (store) operation.  Referenced. This bit is set whenever the page is referenced for any reason, whether load or store. We will see in the next section how these bits are used. Page Replacement All of these hardware methods for implementing paging have one thing in common: When the CPU generates a virtual address for which the corresponding page table entry is marked invalid, the MMU generates a page fault interrupt and the OS must handle the fault as explained above. The OS checks its tables to see why it marked the page as invalid. There are (at least) three possible reasons:  There is a bug in the program being run. In this case the OS simply kills the program (``memory fault -- core dumped'').  UNIX treats a reference just beyond the end of a process' stack as a request to grow the stack. In this case, the OS allocates a page frame, clears it to zeros, and updates the MMU's page tables so that the requested page number points to the allocated frame. 58  The requested page is on disk but not in memory. In this case, the OS allocates a page frame, copies the page from disk into the frame, and updates the MMU's page tables so that the requested page number points to the allocated frame. In all but the first case, the OS is faced with the problem of choosing a frame. If there are any unused frames, the choice is easy, but that will seldom be the case. When memory is heavily used, the choice of frame is crucial for decent performance. We will first consider page-replacement algorithms for a single process, and then consider algorithms to use when there are multiple processes, all competing for the same set of frames. Frame Allocation for a Single Process  FIFO. (First-in, first-out) Keep the page frames in an ordinary queue, moving a frame to the tail of the queue when it it loaded with a new page, and always choose the frame at the head of the queue for replacement. In other words, use the frame whose page has been in memory the longest. While this algorithm may seem at first glance to be reasonable, it is actually about as bad as you can get. The problem is that a page that has been memory for a long time could equally likely be ``hot'' (frequently used) or ``cold'' (unused), but FIFO treats them the same way. In fact FIFO is no better than, and may indeed be worse than  RAND. (Random) Simply pick a random frame. This algorithm is also pretty bad.  OPT. (Optimum) Pick the frame whose page will not be used for the longest time in the future. If there is a page in memory that will never be used again, it's frame is obviously the best choice for replacement. Otherwise, if (for example) page A will be next referenced 8 million instructions in the future and page B will be referenced 6 million instructions in the future, choose page A. This algorithm is sometimes called Belady's MIN algorithm after its inventor. It can be shown that OPT is the best possible algorithm, in the sense that for any reference string (sequence of page numbers touched by a process), OPT gives the smallest number of page faults. Unfortunately, OPT, like SJF processor scheduling, is unimplementable because it requires knowledge of the future. It's only use is as a theoretical limit. If you have an algorithm you think looks promising, see how it compares to OPT on some sample reference strings. 59  LRU. (Least Recently Used) Pick the frame whose page has not been referenced for the longest time. The idea behind this algorithm is that page references are not random. Processes tend to have a few hot pages that they reference over and over again. A page that has been recently referenced is likely to be referenced again in the near future. Thus LRU is likely to approximate OPT. LRU is actually quite a good algorithm. There are two ways of finding the least recently used page frame. One is to maintain a list. Every time a page is referenced, it is moved to the head of the list. When a page fault occurs, the least-recently used frame is the one at the tail of the list. Unfortunately, this approach requires a list operation on every single memory reference, and even though it is a pretty simple list operation, doing it on every reference is completely out of the question, even if it were done in hardware. An alternative approach is to maintain a counter or timer, and on every reference store the counter into a table entry associated with the referenced frame. On a page fault, search through the table for the smallest entry. This approach requires a search through the whole table on each page fault, but since page faults are expected to tens of thousands of times less frequent than memory references, that's ok. A clever variant on this scheme is to maintain an n by n array of bits, initialized to 0, where n is the number of page frames. On a reference to page k, first set all the bits in row k to 1 and then set all bits in column k to zero. It turns out that if row k has the smallest value (when treated as a binary number), then frame k is the least recently used. Unfortunately, all of these techniques require hardware support and nobody makes hardware that supports them. Thus LRU, in its pure form, is just about as impractical as OPT. Fortunately, it is possible to get a good enough approximation to LRU (which is probably why nobody makes hardware to support true LRU).  NRU. (Not Recently Used) There is a form of support that is almost universally provided by the hardware: Each page table entry has a referenced bit that is set to 1 by the hardware whenever the entry is used in a translation. The hardware never clears this bit to zero, but the OS software can clear it whenever it wants. With NRU, the OS arranges for periodic timer interrupts (say once every millisecond) and on each ``tick,'' it goes through the page table and clears all the referenced bits. On a page fault, the OS prefers frames whose referenced bits are still clear, since they contain pages that have not been referenced since the last timer interrupt. The problem with this technique is that the 60 granularity is too coarse. If the last timer interrupt was recent, all the bits will be clear and there will be no information to distinguished frames from each other.  SLRU. (Sampled LRU) This algorithm is similar to NRU, but before the referenced bit for a frame is cleared it is saved in a counter associated with the frame and maintained in software by the OS. One approach is to add the bit to the counter. The frame with the lowest counter value will be the one that was referenced in the smallest number of recent ``ticks''. This variant is called NFU (Not Frequently Used). A better approach is to shift the bit into the counter (from the left). The frame that hasn't been reference for the largest number of ``ticks'' will be associated with the counter that has the largest number of leading zeros. Thus we can approximate the least-recently used frame by selecting the frame corresponding to the smallest value (in binary). (That will select the frame unreferenced for the largest number of ticks, and break ties in favor of the frame longest unreferenced before that). This only approximates LRU for two reasons: It only records whether a page was referenced during a tick, not when in the tick it was referenced, and it only remembers the most recent n ticks, where n is the number of bits in the counter. We can get as close an approximation to true LRU as we like, at the cost of increasing the overhead, by making the ticks short and the counters very long.  Second Chance. When a page fault occurs, look at the page frames one at a time, in order of their physical addresses. If the referenced bit is clear, choose the frame for replacement, and return. If the referenced bit is set, give the frame a ``second chance'' by clearing its referenced bit and going on to the next frame (wrapping around to frame zero at the end of memory). Eventually, a frame with a zero referenced bit must be found, since at worst, the search will return to where it started. Each time this algorithm is called, it starts searching where it last left off. This algorithm is usually called CLOCK because the frames can be visualized as being around the rim of an (analogue) clock, with the current location indicated by the second hand. We have glossed over some details here. First, we said that when a frame is selected for replacement, we have to copy its contents out to disk. Obviously, we can skip this step if the page frame is unused. We can also skip the step if the page is ``clean,'' meaning that it has not been modified since it was read into memory. Most MMU's have a dirty bit associated with each page. When the MMU is setting the referenced bit for a page, it also sets the dirty bit if the 61 reference is a write (store) reference. Most of the algorithms above can be modified in an obvious way to prefer clean pages over dirty ones. For example, one version of NRU always prefers an unreferenced page over a referenced one, but with one category, it prefers clean over dirty pages. The CLOCK algorithm skips frames with either the referenced or the dirty bit set. However, when it encounters a dirty frame, it starts a disk-write operation to clean the frame. With this modification, we have to be careful not to get into an infinite loop. If the hand makes a complete circuit finding nothing but dirty pages, the OS simply has to wait until one of the pagecleaning requests finishes. Hopefully, this rarely if ever happens. There is a curious phenomenon called Belady's Anomaly that comes up in some algorithms but not others. Consider the reference string (sequence of page numbers) 0 1 2 3 0 1 4 0 1 2 3 4. If we use FIFO with three page frames, we get 9 page faults, including the three faults to bring in the first three pages, but with more memory (four frames), we actually get more faults (10). Frame Allocation for Multiple Processes Up to this point, we have been assuming that there is only one active process. When there are multiple processes, things get more complicated. Algorithms that work well for one process can give terrible results if they are extended to multiple processes in a naive way. LRU would give excellent results for a single process, and all of the good practical algorithms can be seen as ways of approximating LRU. A straightforward extension of LRU to multiple processes still chooses the page frame that has not been referenced for the longest time. However, that is a lousy idea. Consider a workload consisting of two processes. Process A is copying data from one file to another, while process B is doing a CPU-intensive calculation on a large matrix. Whenever process A blocks for I/O, it stops referencing its pages. After a while process B steals all the page frames away from A. When A finally finishes with an I/O operation, it suffers a series of page faults until it gets back the pages it needs, then computes for a very short time and blocks again on another I/O operation. There are two problems here. First, we are calculating the time since the last reference to a page incorrectly. The idea behind LRU is `ùse it or lose it.'' If a process hasn't referenced a page for a long time, we take that as evidence that it doesn't want the page any more and re-use the frame for another purpose. But in a multiprogrammed system, there may be two different reasons why a process isn't touching a page: because it is using other pages, or because it is blocked. Clearly, a process should only be penalized for not using a page when it is actually running. To capture 62 this idea, we introduce the notion of virtual time. The virtual time of a process is the amount of CPU time it has used thus far. We can think of each process as having its own clock, which runs only while the process is using the CPU. It is easy for the CPU scheduler to keep track of virtual time. Whenever it starts a burst running on the CPU, it records the current real time. When an interrupt occurs, it calculates the length of the burst that just completed and adds that value to the virtual time of the process that was running. An implementation of LRU should record which process owns each page, and record the virtual time its owner last touched it. Then, when choosing a page to replace, we should consider the difference between the timestamp on a page and the current virtual time of the page's owner. Algorithms that attempt to approximate LRU should do something similar. There is another problem with our naive multi-process LRU. The CPU-bound process B has an unlimited appetite for pages, whereas the I/O-bound process A only uses a few pages. Even if we calculate LRU using virtual time, process B might occasionally steal pages from A. Giving more pages to B doesn't really help it run any faster, but taking from A a page it really needs has a severe effect on A. A moment's thought shows that an ideal page-replacement algorithm for this particular load would divide into two pools. Process A would get as many pages as it needs and B would get the rest. Each pool would be managed LRU separately. That is, whenever B page faults, it would replace the page in its pool that hadn't been referenced for the longest time. In general, each process has a set of pages that it is actively using. This set is called the working set of the process. If a process is not allocated enough memory to hold its working set, it will cause an excessive number of page faults. But once a process has enough frames to hold its working set, giving it more memory will have little or no effect. Figure 5.6 Page frame allocation vs page fault rate More formally, given a number τ, the working set with parameter τ of a process, denoted Wτ, is the set of pages touched by the process during its most recent τ references to memory. Because most processes have a very high degree of locality, the size of τ is not very important provided 63 it's large enough. A common choice of τ is the number of instructions executed in 1/2 second. In other words, we will consider the working set of a process to be the set of pages it has touched during the previous 1/2 second of virtual time. The Working Set Model of program behavior says that the system will only run efficiently if each process is given enough page frames to hold its working set. What if there aren't enough frames to hold the working sets of all processes? In this case, memory is over-committed and it is hopeless to run all the processes efficiently. It would be better to simply stop one of the processes and give its pages to others. Another way of looking at this phenomenon is to consider CPU utilization as a function of the level of multiprogramming (number of processes). With too few processes, we can't keep the CPU busy. Thus as we increase the number of processes, we would like to see the CPU utilization steadily improve, eventually getting close to 100%. Realistically, we cannot expect to quite that well, but we would still expect increasing performance when we add more processes. Figure 5.7 Number of process vs CPU utilization (a) Unfortunately, if we allow memory to become over-committed, something very different may happen: Figure 5.8 Number of process vs CPU utilization (b) After a point, adding more processes doesn't help because the new processes do not have enough memory to run efficiently. They end up spending all their time page-faulting instead of doing useful work. In fact, the extra page-fault load on the disk ends up slowing down other processes 64 until we reach a point where nothing is happening but disk traffic. This phenomenon is called thrashing. The moral of the story is that there is no point in trying to run more processes than will fit in memory. When we say a process ``fits in memory,'' we mean that enough page frames have been allocated to it to hold all of its working set. What should we do when we have more processes than will fit? In a batch system (one were users drop off their jobs and expect them to be run some time in the future), we can just delay starting a new job until there is enough memory to hold its working set. In an interactive system, we may not have that option. Users can start processes whenever they want. We still have the option of modifying the scheduler however. If we decide there are too many processes, we can stop one or more processes (tell the scheduler not to run them). The page frames assigned to those processes can then be taken away and given to other processes. It is common to say the stopped processes have been ``swapped out'' by analogy with a swapping system, since all of the pages of the stopped processes have been moved from main memory to disk. When more memory becomes available (because a process has terminated or because its working set has become smaller) we can ``swap in'' one of the stopped processes. We could explicitly bring its working set back into memory, but it is sufficient (and usually a better idea) just to make the process runnable. It will quickly bring its working set back into memory simply by causing page faults. This control of the number of active processes is called load control. It is also sometimes called medium-term scheduling as contrasted with long-term scheduling, which is concerned with deciding when to start a new job, and short-term scheduling, which determines how to allocate the CPU resource among the currently active jobs. It cannot be stressed too strongly that load control is an essential component of any good pagereplacement algorithm. When a page fault occurs, we want to make a good decision on which page to replace. But sometimes no decision is good, because there simply are not enough page frames. At that point, we must decide to run some of the processes well rather than run all of them very poorly. This is a very good model, but it doesn't immediately translate into an algorithm. Various specific algorithms have been proposed. As in the single process case, some are theoretically good but unimplementable, while others are easy to implement but bad. The trick is to find a reasonable compromise. 65  Fixed Allocation. Give each process a fixed number of page frames. When a page fault occurs use LRU or some approximation to it, but only considers frames that belong to the faulting process. The trouble with this approach is that it is not at all obvious how to decide how many frames to allocate to each process. If you give a process too few frames, it will thrash. If you give it too many, the extra frames are wasted; you would be better off giving those frames to another process, or starting another job (in a batch system). In some environments, it may be possible to statically estimate the memory requirements of each job. For example, a real-time control system tends to run a fixed collection of processes for a very long time. The characteristics of each process can be carefully measured and the system can be tuned to give each process exactly the amount of memory it needs. Fixed allocation has also been tried with batch systems: Each user is required to declare the memory allocation of a job when it is submitted. The customer is charged both for memory allocated and for I/O traffic, including traffic caused by page faults. The idea is that the customer has the incentive to declare the optimum size for his job. Unfortunately, even assuming good will on the part of the user, it can be very hard to estimate the memory demands of a job. Besides, the working-set size can change over the life of the job.  Page-Fault Frequency (PFF). This approach is similar to fixed allocation, but the allocations are dynamically adjusted. The OS continuously monitors the fault rate of each process, in page faults per second of virtual time. If the fault rate of a process gets too high, either give it more pages or swap it out. If the fault rate gets too low, take some pages away. When you get back enough pages this way, either start another job (in a batch system) or restart some job that was swapped out. This technique is actually used in some existing systems. The problem is choosing the right values of ``too high'' and ``too low.'' You also have to be careful to avoid an unstable system, where you are continually stealing pages from a process until it thrashes and then giving them back.  Working Set. The Working Set (WS) algorithm (as contrasted with the working set model) is as follows: Constantly monitor the working set (as defined above) of each process. Whenever a page leaves the working set, immediately take it away from the process and add its frame to a pool of free frames. When a process page faults, allocate it a frame from the pool of free frames. If the pool becomes empty, we have an overload situation--the sum of the working set sizes of the active processes exceeds the size of 66 physical memory--so one of the processes is stopped. The problem is that WS, like SJF or true LRU, is not implementable. A page may leave a process' working set at any time, so the WS algorithm would require the working set to be monitored on every single memory reference. That's not something that can be done by software, and it would be totally impractical to build special hardware to do it. Thus all good multi-process paging algorithms are essentially approximations to WS.  Clock. Some systems use a global CLOCK algorithm, with all frames, regardless of current owner, included in a single clock. As we said above, CLOCK approximates LRU, so global CLOCK approximates global LRU, which, as we said, is not a good algorithm. However, by being a little careful, we can fix the worst failing of global clock. If the clock ``hand'' is moving too ``fast'' (i.e., if we have to examine too many frames before finding one to replace on an average call), we can take that as evidence that memory is over-committed and swap out some process. WSClock. An interesting algorithm has been proposed (but not, to the best of my knowledge widely implemented) that combines some of the best features of WS and CLOCK. Assume that we keep track of the current virtual time VT(p) of each process p. Also assume that in addition to the reference and dirty bits maintained by the hardware for each page frame i, we also keep track of process[i] (the identity of process that owns the page currently occupying the frame) and LR[i] (an approximation to the time of the last reference to the frame). The time stamp LR[i] is expressed as the last reference time according to the virtual time of the process that owns the frame. In the flowchart below, the WS parameter (the size of the window in virtual time used to determine whether a page is in the working set) is denoted by the Greek letter tau. The parameter F is the number of frames--i.e., the size of physical memory divided by the page size. Like CLOCK, WSClock walks through the frames in order, looking for a good candidate for replacement, cleaning the reference bits as it goes. If the frame has been referenced since it was last inspected, it is given a ``second chance''. (The counter LR[i] is also updated to indicate that page has been referenced recently in terms of the virtual time of its owner.) If not, the page is given a ``third chance'' by seeing whether it appears to be in the working set of its owner. 67 Figure 5.9 CLOCK algorithm flowchart The time since its last reference is approximately calculated by subtracting LR[i] from the current (virtual) time. If the result is less than the parameter tau, the frame is passed over. If the page fails this test, it is either used immediately or scheduled for cleaning (writing its contents out to disk and clearing the dirty bit) depending on whether it is clean or dirty. There is one final complication: If a frame is about to be passed over because it was referenced recently, the algorithm checks whether the owning process is active, and takes the frame anyhow if not. This extra check allows the algorithm to grab the pages of processes that have been stopped by the load-control algorithm. Without it, pages of stopped processes would never get any `òlder'' because the virtual time of a stopped process stops advancing. Like CLOCK, WSClock has to be careful to avoid an infinite loop. As in the CLOCK algorithm, it may may a complete circuit of the clock finding only dirty candidate pages. In that case, it has to wait for one of the cleaning requests to finish. It may also find that all pages are unreferenced but "new" (the reference bit is clear but the comparison to tau shows the page has been referenced recently). In either case, memory is overcommitted and some process needs to be stopped. 68 5.4 Virtual Memory In accord with the beautification principle, paging makes the main memory of the computer look more ``beautiful'' in several ways.  It gives each process its own virtual memory, which looks like a private version of the main memory of the computer. In this sense, paging does for memory what the process abstraction does for the CPU. Even though the computer hardware may have only one CPU (or perhaps a few CPUs), each `ùser'' can have his own private virtual CPU (process). Similarly, paging gives each process its own virtual memory, which is separate from the memories of other processes and protected from them.  Each virtual memory looks like a linear array of bytes, with addresses starting at zero. This feature simplifies relocation: Every program can be compiled under the assumption that it will start at address zero.  It makes the memory look bigger, by keeping infrequently used portions of the virtual memory space of a process on disk rather than in main memory. This feature both promotes more efficient sharing of the scarce memory resource among processes and allows each process to treat its memory as essentially unbounded in size. Just as a process doesn't have to worry about doing some operation that may block because it knows that the OS will run some other process while it is waiting, it doesn't have to worry about allocating lots of space to a rarely (or sparsely) used data structure because the OS will only allocate real memory to the part that's actually being used. 5.5 Segmentation Segmentation caries this feature one step further by allowing each process to have multiple ``simulated memories.'' Each of these memories (called a segment) starts at address zero, is independently protected, and can be separately paged. In a segmented system, a memory address has two parts: a segment number and a segment offset. Most systems have some sort of segmentation, but often it is quite limited. UNIX has exactly three segments per process. One segment (called the text segment) holds the executable code of the process. It is generally1 readonly, fixed in size when the process starts, and shared among all processes running the same program. Sometimes read-only data (such as constents) are also placed in this segment. Another segment (the data segment) holds the memory used for global variables. Its protection is read/write (but usually not executable), and is normally not shared between processes. There is a 69 special system call to extend the size of the data segment of a process. The third segment is the stack segment. As the name implies, it is used for the process' stack, which is used to hold information used in procedure calls and returns (return address, saved contents of registers, etc.) as well as local variables of procedures. Like the data segment, the stack is read/write but usually not executable. The stack is automatically extended by the OS whenever the process causes a fault by referencing an address beyond the current size of the stack (usually in the course of a procedure call). It is not shared between processes. Some variants of UNIX have a fourth segment, which contains part of the OS data structures. It is read-only and shared by all processes. Many application programs would be easier to write if they could have as many segments as they liked. As an example of an application program that might want multiple segments, consider a compiler. In addition to the usual text, data, and stack segments, it could use one segment for the source of the program being compiled, one for the symbol table, etc. Breaking the address space up into segments also helps sharing. For example, most programs in UNIX include the library program printf. If the executable code of printf were in a separate segment, that segment could easily be shared by multiple processes, allowing (slightly) more efficient sharing of physical memory. If you think of the virtual address as being the concatenation of the segment number and the segment offset, segmentation looks superficially like paging. The main difference is that the application programmer is aware of the segment boundaries, but can ignore the fact that the address space is divided up into pages. The implementation of segmentation is also superficially similar to the implementation of paging. The segment number is used to index into a table of ``segment descriptors,'' each of which contains the length and starting address of a segment as well as protection information. If the segment offset not less than the segment length, the MMU traps with a segmentation violation. Otherwise, the segment offset is added to the starting address in the descriptor to get the resulting physical address. There are several differences between the implementation of segments and pages, all derived from the fact that the size of a segment is variable, while the size of a page is ``built-in.''  The size of the segment is stored in the segment descriptor and compared with the segment offset. The size of a page need not be stored anywhere because it is always the 70 same. It is always a power of two and the page offset has just enough bits to represent any legal offset, so it is impossible for the page offset to be out of bounds. For example, if the page size is 4k (4096) bytes, the page offset is a 12-bit field, which can only contain numbers in the range 0...4095.  The segment descriptor contains the physical address of the start of the segment. Since all page frames are required to start at an address that is a multiple of the page size, which is a power of two, the low-order bits of the physical address of a frame are always zero. For example, if pages are 4k bytes, the physical address of each page frame ends with 12 zeros. Thus a page table entry contains a frame number, which is just the higher-order bits of the physical address of the frame, and the MMU concatenates the frame number with the page offset, as contrasted with adding the physical address of a segment with the segment offset. 5.6 Implementation Multics One of the advantages of segmentation is that each segment can be large and can grow dynamically. To get this effect, we have to page each segment. One way to do this is to have each segment descriptor contain the (physical) address of a page table for the segment rather than the address of the segment itself. This is the way segmentation works in Multics, the granddaddy of all modern operating systems and a pioneer of the idea of segmentation. Multics ran on the General Electric (later Honeywell) 635 computer, which was a 36-bit word-addressable machine, which means that memory is divided into 36-bit words, with consecutive words having addresses that differ by 1 (there were no bytes). A virtual address was 36 bits long, with the high 18 bits interpreted as the segment number and the low 18 bits as segment offset. Although 18 bits allows a maximum size of 218 = 262,144 words, the software enforced a maximum segment size of 216 = 65,536 words. Thus the segment offset is effectively 16 bits long. Associated with each process is a table called the descriptor segment. There is a register called the Descriptor Segment Base Register (DSBR) that points to it and a register called the Descriptor Segment Length Register (DSLR) that indicates the number of entries in the descriptor segment. 71 Figure 5.10 Implementation of memory allocation in Multics First the segment number in the virtual address is used to index into the descriptor segment to find the appropriate descriptor. (If the segment number is too large, a fault occurs). The descriptor contains permission information, which is checked to see if the current process has rights to access the segment as requested. If that check succeeds, the memory address of a page table for the segment is found in the descriptor. Since each page is 1024 words long, the 16-bit segment offset is interpreted as a 6-bit page number and a 10-bit offset within the page. The page number is used to index into the page table to get an entry containing a valid bit and frame number. If the valid bit is set, the physical address of the desired word is found by concatenating the frame number with the 10-bit page offset from the virtual address. Actually, we have left out one important detail to simplify the description. The ``descriptor segment'' really is a segment, which means it really is paged, just like any other segment. Thus there is another page table that is the page table for the descriptor segment. The 18-bit segment number from the virtual address is split into an 8-bit page number and a 10-bit offset. The page number is used to select an entry from the descriptor segment's page table. That entry contains the physical address of a page of the descriptor segment, and the page-offset field of the segment number, is used to index into that page to get the descriptor itself. The rest of the translation occurs as described in the preceding paragraph. In total, each memory reference turns into four accesses to memory.  one to retrieve an entry from the descriptor segment's page table,  one to retrieve the descriptor itself,  one to retrieve an entry from the page table for the desired segment, and 72  One to load or store the desired data. Multics used a TLB mapping the segment number and page number within the segment to a page frame to avoid three of these accesses in most cases. Intel x86 The Intex 386 (and subsequent members of the X86 family used in personal computers) uses a different approach to combining paging with segmentation. A virtual address consists of a 16-bit segment selector and a 16 or 32-bit segment offset. The selector is used to fetch a segment descriptor from a table (actually, there are two tables and one of the bits of the selector is used to choose which table). The 64-bit descriptor contains the 32-bit address of the segment (called the segment base) 21 bits indicating its length, and miscellaneous bits indicating protections and other options. The segment length is indicated by a 20-bit limit and one bit to indicate whether the limit should be interpreted as bytes or pages. (The segment base and limit ``fields'' are actually scattered around the descriptor to provide compatibility with earlier version of the hardware.) If the offset from the original virtual address does not exceed the segment length, it is added to the base to get a ``physical'' address called the linear address (see Fig 9.20 on page 292). If paging is turned off, the linear address really is the physical address. Otherwise, it is translated by a two-level page table as described previously, with the 32-bit address divided into two 10-bit page numbers and a 12 bit offset (a page is 4K on this machine). We have to say ``generally'' here and elsewhere when we talk about UNIX because there are many variants of UNIX in existence. Sometimes we will use the term ``classic UNIX'' to decribe the features that were in UNIX before it spread to many distinct dialects. Features in classic UNIX are generally found in all of its dialects. Sometimes features introduced in one variant became so popular that they were widely imitated and are now available in most dialects. This a good example of one of those ``popular'' features not in classic UNIX but in most modern variants: System V (an AT&T variant of UNIX) introduced the ability to map a chunk of virtual memory into the address spaces of multiple processes at some offset in the data segment (perhaps a different offset in each process). This chunk is called a ``shared memory segment,'' but is not a segment in the sense we are using the term here. So-called ``System V shared memory'' is available in most current versions of UNIX. Many variants of UNIX get a similar effect with so-called ``shared libraries,'' which are implemented with shared memory but without general-purpose segmentation support. 73 Paging Details Real-world hardware CPUs have all sorts of ``features'' that make life hard for people trying to write page-fault handlers in operating systems. Among the practical issues are the following. Page Size How big should a page be? This is really a hardware design question, but since it depends on OS considerations, we will discuss it here. If pages are too large, lots of space will be wasted by internal fragmentation: A process only needs a few bytes, but must take a full page. As a rough estimate, about half of the last page of a process will be wasted on the average. Actually, the average waste will be somewhat larger, if the typical process is small compared to the size of a page. For example, if a page is 8K bytes and the typical process is only 1K, 7/8 of the space will be wasted. Also, the relative amount of waste as a percentage of the space used depends on the size of a typical process. All these considerations imply that as typical processes get bigger and bigger, internal fragmentation becomes less and less of a problem. On the other hand, with smaller pages it takes more page table entries to describe a given process, leading to space overhead for the page tables, but more importantly time overhead for any operation that manipulates them. In particular, it adds to the time needed to switch form one process to another. The details depend on how page tables are organized. For example, if the page tables are in registers, those registers have to be reloaded. A TLB will need more entries to cover the same size ``working set,'' making it more expensive and require more time to re-load the TLB when changing processes. In short, all current trends point to larger and larger pages in the future. If space overhead is the only consideration, it can be shown that the optimal size of a page is sqrt(2se), where s is the size of an average process and e is the size of a page-table entry. This calculation is based on balancing the space wasted by internal fragmentation against the space used for page tables. This formula should be taken with a big grain of salt however, because it overlooks the time overhead incurred by smaller pages. Restarting the instruction After the OS has brought in the missing page and fixed up the page table, it should restart the process in such a way as to cause it to re-try the offending instruction. Unfortunately, that may not be easy to do, for a variety of reasons. 74 Variable-length instructions Some CPU architectures have instructions with varying numbers of arguments. For example the Motorola 68000 has a move instruction with two arguments (source and target of the move). It can cause faults for three different reasons: the instruction itself or either of the two operands. The fault handler has to determine which reference faulted. On some computers, the OS has to figure that out by interpreting the instruction and in effect simulating the hardware. The 68000 made it easier for the OS by updating the PC as it goes, so the PC will be pointing at the word immediate following the part of the instruction that caused the fault. On the other hand, this makes it harder to restart the instruction: How can the OS figure out where the instruction started, so that it can back the PC up to retry? Side effects Some computers have addressing modes that automatically increment or decrement index registers as a side effect, making it easy to simulate in one step the effect of the C statement *p++ = *q++;. Unfortunately, if an instruction faults part-way through, it may be difficult to figure out which registers have been modified so that they can be restored to their original state. Some computers also have instructions such as ``move characters,'' which work on variablelength data fields, updating a pointer or count register. If an operand crosses a page boundary, the instruction may fault part-way through, leaving a pointer or counter register modified. Fortunately, most CPU designers know enough about operating systems to understand these problems and add hardware features to allow the OS to recover. Either they undo the effects of the instruction before faulting, or they dump enough information into registers somewhere that the OS can undo them. The original 68000 did neither of these and so paging was not possible on the 68000. It wasn't that the designers were ignorant of OS issues; it was just that there was not enough room on the chip to add the features. However, one clever manufacturer built a box with two 68000 CPUs and an MMU chip. The first CPU ran `ùser'' code. When the MMU detected a page fault, instead of interrupting the first CPU, it delayed responding to it and interrupted the second CPU. The second CPU would run all the OS code necessary to respond to the fault and then cause the MMU to retry the storage access. This time, the access would succeed and return the desired result to the first CPU, which never realized there was a problem. 75 Locking Pages There are a variety of cases in which the OS must prevent certain page frames from being chosen by the page-replacement algorithm. For example, suppose the OS has chosen a particular frame to service a page fault and sent a request to the disk scheduler to read in the page. The request may take a long time to service, so the OS will allow other processes to run in the meantime. It must be careful, however, that a fault by another process does not choose the same page frame! A similar problem involves I/O. When a process requests an I/O operation it gives the virtual address of the buffer the data is supposed to be read into or written out of. Since DMA devices generally do not know anything about virtual memory, the OS translates the buffer address into a physical memory location (a frame number and offset) before starting the I/O device. It would be very embarrassing if the frame were chosen by the page-replacement algorithm before the I/O operation completes. Both of these problems can be avoided by marking the frame a ineligible for replacement. We usually say that the page in that frame is ``pinned'' in memory. An alternative way of avoid the I/O problem is to do the I/O operation into or out of pages that belong to the OS kernel (and are not subject to replacement) and copying between these pages and user pages. Missing Reference Bits At least one popular computer, the Digital Equipment Corp. VAX computer, did not have any REF bits in its MMU. Some people at the University of California at Berkeley came up with a clever way of simulating the REF bits in software. Whenever the OS cleared the simulated REF bit for a page, it mark the hardware page-table entry for the page as invalid. When the process first referenced the page, it would cause a page fault. The OS would note that the page really was in memory, so the fault handler could return without doing any I/O operations, but the fault would give the OS the chance to turn the simulated REF bit on and mark the page as valid, so subsequent references to the page would not cause page faults. Although the software simulated hardware with a real real REF bit, the net result was that there was a rather high cost to clearing the simulated REF bit. The people at Berkeley therefore developed a version of the CLOCK algorithm that allowed them to clear the REF bit infrequently. Fault Handling. Overall, the core of the OS kernel looks something like this: // This is the procedure that gets called when an interrupt occurs // on some computers, there is a different handler for each "kind" // of interrupt. void handler() { save_process_state(current_PCB); 76 // Some state (such as the PC) is automatically saved by the HW. // This code copies that info to the PCB and possibly saves some // more state. switch (what_caused_the_trap) { case PAGE_FAULT: f = choose_frame(); if (is_dirty(f)) schedule_write_request(f); // to clean the frame else schedule_read_request(f); // to read in requested page record_state(current_PCB); // to indicate what this process is up to make_unrunnable(current_PCB); current_PCB = select_some_other_ready_process(); break; case IO_COMPLETION: p = process_that_requested_the_IO(); switch (reason_for_the_IO) { case PAGE_CLEANING: schedule_read_request(f); to read in requested page break; case BRING_IN_NEW_PAGE: case EXPLICIT_IO_REQUEST: make_runnable(p); break; } case IO_REQUEST: schedule_io_request(); record_state(current_PCB); // to indicate what this process is up to make_unrunnable(current_PCB); current_PCB = select_some_other_ready_process(); break; case OTHER_OS_REQUEST: perform_request(); break; } // At this point, the current_PCB is pointing to a process that // is ready to run. It may or may not be the process that was // running when the interrupt occurred. restore_state(current_PCB); return_from_interrupt(current_PCB); // This hardware instruction restores the PC (and possibly other // hardware state) and allows the indicated process to continue. } 5.7 Exercise 1. Name two differences between logical and physical addresses. Answer: A logical address does not refer to an actual existing address; rather, it refers to an abstract address in an abstract address space. Contrast this with a physical address that refers to an actual physical address in memory. A logical address is generated by the CPU and is 77 translated into a physical address by the memory management unit (MMU). Therefore, physical addresses are generated by the MMU. 2. Consider a system in which a program can be separated into two parts: code and data. The CPU knows whether it wants an instruction (instruction fetch) or data (data fetch or store). Therefore, two base–limit register pairs are provided: one for instructions and one for data. The instruction base–limit register pair is automatically read-only, so programs can be shared among different users. Discuss the advantages and disadvantages of this scheme. Answer: The major advantage of this scheme is that it is an effective mechanism for code and data sharing. For example, only one copy of an editor or a compiler needs to be kept in memory, and this code can be shared by all processes needing access to the editor or compiler code. Another advantage is protection of code against erroneous modification. The only disadvantage is that the code and data must be separated, which is usually adhered to in a compiler -generated code. 3. Why are page sizes always powers of 2? Answer: Recall that paging is implemented by breaking up an address into a page and offset number. It is most efficient to break the address into X page bits and Y offset bits, rather than perform arithmetic on the address to calculate the page number and offset. Because each bit position represents a power of 2, splitting an address between bits results in a page size that is a power of 2. 4. Consider a logical address space of eight pages of 1024 words each, mapped onto a physical memory of 32 frames. (a) How many bits are there in the logical address? (b) How many bits are there in the physical address? Answer: (a) Logical address: 13 bits (b) Physical address: 15 bits 5. What is the effect of allowing two entries in a page table to point to the same page frame in memory? Explain how this effect could be used to decrease the amount of time needed to 78 copy a large amount of memory from one place to another. What effect would updating some byte on the one page have on the other page? Answer: By allowing two entries in a page table to point to the same page frame in memory, users can share code and data. If the code is reentrant, much memory space can be saved through the shared use of large programs such as text editors, compilers, and database systems. “Copying” large amounts of memory could be affected by having different page tables point to the same memory location. However, sharing of non reentrant code or data means that any user having access to the code can modify it and these modifications would be reflected in the other user’s “copy.” 79 80 6. Device Management So far, we have covered how an operating system manages CPU and memory resources. However, a computer is not so interesting without I/O devices (e.g., hard drives, network cards, screen displays, keyboards, mice, rats, and so on). Device management is the part of the OS that manages hardware devices. Device management tries to (1) provide a uniform interface to ease the access to devices with different physical characteristics, and (2) optimize the performance of individual devices. 6.1 I/O Devices I/O devices can be roughly divided into two categories. A block device (e.g., disks) stores information in fixed-size blocks, each one with its own address. A character device (e.g., keyboards, printers, network cards) delivers or accepts a stream of characters, and individual characters are not addressable. A device is connected to a computer through an electronic component, or a device controller, which converts between the serial bit stream and a block of bytes and performs error correction if necessary. Each controller has a few device registers that are used for communicating with the CPU, and a data buffer that an OS can read or write. Since the number of device registers and the natures of device instructions vary from device to device, a device driver OS component is responsible hiding the complexity of an I/O device, so that the OS can access various devices in a relatively uniform manner. User level User applications Various OS components OS level Device drivers Device controllers Hardware I/O devices Figure 6.1 Structure of I/O system 81 6.2 Device Addressing In general, there are two approaches to addressing these device registers and data buffers. The first approach is to assign each device a dedicated range of device addresses in the physical memory, so accessing those device addresses requires special hardware instructions associated with individual devices. The second approach (memory-mapped I/O) is not to distinguish device addresses from normal memory addresses, so devices can be accessed the same way as normal memory, with the same set of hardware instructions. Memory addresses Primary memory Device addresses Device 0 Memory addresses Device 1 Separate device addresses Memory-mapped I/O Figure 6.2 Device I/O addressing 6.3 Device Accesses Regardless of the device addressing approach, the operating system has to track the status of a device for exchanging data. The simplest approach is to use polling, where the CPU repeatedly checks the status of a device for exchanging data. However, wasting CPU cycles on busy-waiting is undesirable. A better approach is to use interrupt-driven I/Os, where a device controller notifies the corresponding device driver when the device is available. Although the interrupt-driven approach is much more efficient than polling, the CPU is still actively involved in copying data between the device and memory. Also, interrupt-driven I/Os still impose high overheads for character devices. For example, a printer raises one interrupt per byte, so the overhead of interrupt far exceeds the cost of transmitting a single byte. An even better approach is to use an additional direct memory access (DMA) controller to perform the actual movements of data, so the CPU can use the cycles for computation as opposed to copying data. 82 The use of DMA alone still has room for improvement. Since a process cannot access the data that is being brought into memory at the moment, due to mutual exclusion, a more efficient approach is to pipeline the data transfer. The double buffering technique uses two buffers in the following way: while one is being used, the other is being filled. Double buffering is also used extensively for graphics and smooth animation. While the screen displays an image frame from one buffer in the video controller, a separate buffer is being filled pixel-by-pixel in the background, so a viewer does not see the line-by-line scanning on the screen. Once the background buffer is filled, the video controller switches the roles of the two buffers and displays from the freshly filled buffer. 6.4 Overlapped I/O and CPU Processing By freeing up CPU cycles while devices are serving requests, CPU-bound processes can be executed concurrently with I/O-bound processes. For example, if process A is CPU-bound, and process B is I/O-bound, the system as a whole can reach high utilization by overlapping CPU and I/O processing effectively. Process A Loop: 90 msec of CPU 10 msec of I/O A Process B Loop: 10 msec of CPU 90 msec of I/O B A CPU I/O A B Figure 6.3 I/O and CPU processing 6.5 Disk as an Example Device The hard disk is a 30-year-old storage technology, and is incredibly complicated. A modern hard drive comes with 250,000 lines of micro code to govern various hard drive components. Hardware Characteristics Briefly, a hard drive consists of a disk arm and disk platters. Disk platters are coated with magnetic materials for recording. The disk arm moves a comb of disk heads, among which only one disk head is active for reading and writing. 83 One fascinating detail is that heads are aerodynamically designed to fly as close to the surface as possible. In fact, the distance is so close that there is no room for air molecules, and a hard drive is filled with special inert gas to fly disk heads. If a head touches the surface, it results in a head crash, which scrapes off magnetic information. Each disk platter is further divided into concentric tracks of storage, and each track is divided into sectors (typically 512 bytes). Each sector is a minimum unit of disk storage. A cylinder consists of all tracks with a given arm position. Track Disk platters Disk arm Sector Figure 6.4 Configuration of Hard-disk A modern hard drive also takes advantage of the disk geometry. Disk cylinders are further grouped into zones, so zones near the edge of the disk can store more information than zones near the center of the disk due to the differences in storage area (also known as zone-bit recording). More information stored in outer zones also means that the transfer rate (rotational speed multiplied by the information stored in a cylinder) is higher near the edge of the disk. Since moving a disk arm from one track to the next takes time, the starting position of the next track is slightly skewed (track skew), so that a sequential transfer of bytes across multiple tracks can incur minimum rotational delay. A hard drive also periodically performs therm-calibrations, which adjusts the disk head positioning according to the changes in the disk radius caused by temperature changes. To account for other minor physical inaccuracies, typically 100 to 1000 bits are inserted between sectors. A Simple Model of Disk Performance, the access time to read or write a disk section includes three components:  Seek time: the time to position heads over a cylinder (~8 msec on average). 84  Rotational delay: the time to wait for the target sector to rotate underneath the head. Assuming a speed of 7,200 rotations per minute, or 120 rotations per second, each rotation takes ~8 msec, and the average rotational delay is ~4 msec.  Transfer time: the time to transfer bytes. Assuming a peak bandwidth of 58 Mbytes/sec, transferring a disk block of 4 Kbytes takes 0.07 msec. Thus, the overall time to perform a disk I/O = seek time + rotational delay + transfer time. The sum of the seek time and the rotational delay is the disk latency, or the time to initiate a transfer. The transfer rate is the disk bandwidth. If a disk block is randomly placed on disk, then the disk access time is roughly 12 msec to fetch 4 Kbytes of data, or a bandwidth 340 Kbytes/sec. If a disk block is randomly located on the same disk cylinder as the current disk arm position, the access time is roughly 4 msec without the seek time, or a bandwidth of 1.4 Mbytes/sec. If the next sector is on the same track, the access time is 58 Mbytes/sec without the seek time and the rotational delay. Therefore, the key to using the hard drive effectively is to minimize the seek time and rotational latency. Disk Tradeoffs One design decision is the size of disk sector. Sector size Space utilization 1 byte 8 bits/1008 bits (0.8%) 4 Kbytes 4096 bytes/4221 bytes (97%) 1 Mbyte (~100%) Transfer rate 80 bytes/sec (1 byte / 12 msec) 340 Kbytes/sec (4 Kbytes / 12 msec) 58 Mbytes/sec (peak bandwidth) Table 6.1 Wasteful allocation of disk space A bigger sector size seems to get a more effective transfer rate from the hard drive. However, this allocation granularity is wasteful if only 1 byte out of 1 Mbyte is needed for storage. 6.6 Disk Controller and Disk Device Driver Two popular disk controllers are SCSI (small computer systems interface), and IDE (integrated device electronics). Since they are not a part of the OS, please surf the net for more information. One major function of the disk device driver is to reduce the seek time for disk accesses. Since disk can serve only one request at a time, the device driver can schedule the disk request in such a way to minimize disk arm movements. There are a handful of disk scheduling strategies. Please read Nutt’s book for detailed examples. FIFO 85 Requests are served in the order of arrival. This policy is fair among requesters, but requests may land on random spots on disk. Therefore, the seek time may be long. SSTF (Shortest Seek Time First) The shortest seek time first approach picks the request that is closest to the current disk arm position. (Although called the shortest seek time first, this approach actually includes the rotational delay in calculation, since rotation can be as long as seek.) SSTF is good at reducing seeks, but may result in starvation. SCAN SCAN implements an elevator algorithm. It takes the closet request in the direction of travel. It guarantees no starvation, but retains the flavor of SSTF. However, if a disk is heavily loaded with requests, a new request at a location that has been just recently scanned can wait for almost two full scans of the disk. C-SCAN (Circular SCAN) For C-SCAN, the disk arm always serves requests by scanning in one direction. Once the arm finishes scanning for one direction, it quickly returns to the 0th track for the next round of scanning. 6.7 Exercises 1. The accelerating seek described in Exercise 12.3 is typical of hard-disk drives. By contrast, floppy disks (and many hard disks manufactured before the mid-1980s) typically seek at a fixed rate. Suppose that the disk in Exercise 12.3 has a constant-rate seek rather than a constant acceleration seek, so the seek time is of the form t = x + yL, where t is the time in milliseconds and L is the seek distance. Suppose that the time to seek to an adjacent cylinder is 1 millisecond, as before, and is 0.5 milliseconds for each additional cylinder. (a) Write an equation for this seek time as a function of the seek distance. (b) Using the seek-time function from part a, calculate the total seek time for each of the schedules in Exercise 12.2. Is your answer the same as it was for Exercise 12.3(c)? (c) What is the percentage speedup of the fastest schedule over FCFS in this case? Answer: (a) The equation is t = 0.95 + 0.05L 86 (b) FCFS 362.60; SSTF 95.80; SCAN 497.95; LOOK 174.50; C-SCAN 500.15; (and CLOOK 176.70). SSTF is still the winner, and LOOK is the runner-up. (c) (362.60 − 95.80)/362.60 = 0.74 The percentage speedup of SSTF over FCFS is 74%, with respect to the seek time. If we include the overhead of rotational latency and data transfer, the percentage speedup will be less. 2. Is disk scheduling, other than FCFS scheduling, useful in a single-user environment? Explain your answer. Answer: In a single-user environment, the I/O queue usually is empty. Requests generally arrive from a single process for one block or for a sequence of consecutive blocks. In these cases, FCFS is an economical method of disk scheduling. But LOOK is nearly as easy to program and will give much better performance when multiple processes are performing concurrent I/O, such as when a Web browser retrieves data in the background while the operating system is paging and another application is active in the foreground. 3. Explain why SSTF scheduling tends to favor middle cylinders over the innermost and outermost cylinders. Answer: The center of the disk is the location having the smallest average distance to all other tracks. Thus the disk head tends to move away from the edges of the disk. Here is another way to think of it. The current location of the head divides the cylinders into two groups. If the head is not in the center of the disk and a new request arrives, the new request is more likely to be in the group that includes the center of the disk; thus, the head is more likely to move in that direction. 4. Why rotational latency is usually not considered in disk scheduling? How would you modify SSTF, SCAN, and C-SCAN to include latency optimization? Answer: Most disks do not export their rotational position information to the host. Even if they did, the time for this information to reach the scheduler would be subject to imprecision and the time consumed by the scheduler is variable, so the rotational position information would 87 become incorrect. Further, the disk requests are usually given in terms of logical block numbers, and the mapping between logical blocks and physical locations is very complex. 5. How would use of a RAM disk affect your selection of a disk-scheduling algorithm? What factors would you need to consider? Do the same considerations apply to hard-disk scheduling, given that the file system stores recently used blocks in a buffer cache in main memory? Answer: Disk scheduling attempts to reduce the overhead time of disk head positioning. Since a RAM disk has uniform access times, scheduling is largely unnecessary. The comparison between RAM disk and the main memory disk-cache has no implications for hard-disk scheduling because we schedule only the buffer cache misses, not the requests that find their data in main memory. 6. Why is it important to balance file system I/O among the disks and controllers on a system in a multitasking environment? Answer: A system can perform only at the speed of its slowest bottleneck. Disks or disk controllers are frequently the bottleneck in modern systems as their individual performance cannot keep up with that of the CPU and system bus. By balancing I/O among disks and controllers, neither an individual disk nor a controller is overwhelmed, so that bottleneck is avoided. 88 7. File Management 7.1 General Concepts Just as the process abstraction beautifies the hardware by making a single CPU (or a small number of CPUs) appear to be many CPUs, one per `ùser,'' the file system beautifies the hardware disk, making it appear to be a large number of disk-like objects called files. Like a disk, a file is capable of storing a large amount of data cheaply, reliably, and persistently. The fact that there are lots of files is one form of beautification: Each file is individually protected, so each user can have his own files, without the expense of requiring each user to buy his own disk. Each user can have lots of files, which makes it easier to organize persistent data. The file system also makes each individual file more beautiful than a real disk. At the very least, it erases block boundaries, so a file can be any length (not just a multiple of the block size) and programs can read and write arbitrary regions of the file without worrying about whether they cross block boundaries. Some systems (not UNIX) also provide assistance in organizing the contents of a file. Systems use the same sort of device (a disk drive) to support both virtual memory and files. The question arises why these have to be distinct facilities, with vastly different user interfaces. The answer is that they don't. In Multics, there was no difference whatsoever. Everything in Multics was a segment. The address space of each running process consisted of a set of segments (each with its own segment number), and the ``file system'' was simply a set of named segments. To access a segment from the file system, a process would pass its name to a system call that assigned a segment number to it. From then on, the process could read and write the segment simply by executing ordinary loads and stores. For example, if the segment was an array of integers, the program could access the ith number with a notation like a[i] rather than having to seek to the appropriate offset and then execute a read system call. If the block of the file containing this value wasn't in memory, the array access would cause a page fault, which was serviced as explained in the previous chapter. This user-interface idea, sometimes called ``single-level store,'' is a great idea. So why is it not common in current operating systems? In other words, why are virtual memory and files presented as very different kinds of objects? There are possible explanations one might propose: The address space of a process is small compared to the size of a file system. 89 There is no reason why this has to be so. In Multics, a process could have up to 256K segments, but each segment was limited to 64K words. Multics allowed for lots of segments because every ``file'' in the file system was a segment. The upper bound of 64K words per segment was considered large by the standards of the time; The hardware actually allowed segments of up to 256K words (over one megabyte). Most new processors introduced in the last few years allow 64-bit virtual addresses. In a few years, such processors will dominate. So there is no reason why the virtual address space of a process cannot be large enough to include the entire file system. The virtual memory of a process is transient--it goes away when the process terminates--while files must be persistent. Multics showed that this doesn't have to be true. A segment can be designated as ``permanent,'' meaning that it should be preserved after the process that created it terminates. Permanent segments to raise a need for one ``file-system-like'' facility, the ability to give names to segments so that new processes can find them. Files are shared by multiple processes, while the virtual address space of a process is associated with only that process. Most modern operating systems (including most variants of UNIX) provide some way for processes to share portions of their address spaces anyhow, so this is a particularly weak argument for a distinction between files and segments. The real reason single-level store is not ubiquitous is probably a concern for efficiency. The usual file-system interface encourages a particular style of access: Open a file, go through it sequentially, copying big chunks of it to or from main memory, and then close it. While it is possible to access a file like an array of bytes, jumping around and accessing the data in tiny pieces, it is awkward. Operating system designers have found ways to implement files that make the common ``file like'' style of access very efficient. While there appears to be no reason in principle why memory-mapped files cannot be made to give similar performance when they are accessed in this way, in practice, the added functionality of mapped files always seems to pay a price in performance. Besides, if it is easy to jump around in a file, applications programmers will take advantage of it, overall performance will suffer, and the file system will be blamed. 90 Naming Every file system provides some way to give a name to each file. We will consider only names for individual files here, and talk about directories later. The name of a file is (at least sometimes) meant to be used by human beings, so it should be easy for humans to use. Different operating systems put different restrictions on names. Size Some systems put severe restrictions on the length of names. For example DOS restricts names to 11 characters, while early versions of UNIX (and some still in use today) restrict names to 14 characters. The Macintosh operating system, Windows 95, and most modern version of UNIX allow names to be essentially arbitrarily long. We say `èssentially'' since names are meant to be used by humans, so they don't really to to be all that long. A name that is 100 characters long is just as difficult to use as one that it forced to be under 11 characters long (but for different reasons). Most modern versions of UNIX, for example, restrict names to a limit of 255 characters.1 Case Are upper and lower case letters considered different? The UNIX tradition is to consider the names Foo and foo to be completely different and unrelated names. In DOS and its descendants, however, they are considered the same. Some systems translate names to one case (usually upper case) for storage. Others retain the original case, but consider it simply a matter of decoration. For example, if you create a file named ``Foo,'' you could open it as ``foo'' or ``FOO,'' but if you list the directory, you would still see the file listed as ``Foo''. Character Set Different systems put different restrictions on what characters can appear in file names. The UNIX directory structure supports names containing any character other than NUL (the byte consisting of all zero bits), but many utility programs (such as the shell) would have troubles with names that have spaces, control characters or certain punctuation characters (particularly `/'). MacOS allows all of these (e.g., it is not uncommon to see a file name with the Copyright symbol © in it). With the world-wide spread of computer technology, it is becoming increasingly important to support languages other than English, and in fact alphabets other than Latin. There is a move to support character strings (and in particular file names) in the Unicode character set, 91 which devotes 16 bits to each character rather than 8 and can represent the alphabets of all major modern languages from Arabic to Devanagari to Telugu to Khmer. Format It is common to divide a file name into a base name and an extension that indicates the type of the file. DOS requires that each name be compose of a bast name of eight or less characters and an extension of three or less characters. When the name is displayed, it is represented as base.extension. UNIX internally makes no such distinction, but it is a common convention to include exactly one period in a file name (e.g. foo.c for a C source file). 7.2 File System Structure UNIX hides the ``chunkiness'' of tracks, sectors, etc. and presents each file as a ``smooth'' array of bytes with no internal structure. Application programs can, if they wish, use the bytes in the file to represent structures. For example, a wide-spread convention in UNIX is to use the newline character (the character with bit pattern 00001010) to break text files into lines. Some other systems provide a variety of other types of files. The most common are files that consist of an array of fixed or variable size records and files that form an index mapping keys to values. Indexed files are usually implemented as B-trees. File Types Most systems divide files into various ``types.'' The concept of ``type'' is a confusing one, partially because the term ``type'' can mean different things in different contexts. UNIX initially supported only four types of files: directories, two kinds of special files (discussed later), and ``regular'' files. Just about any type of file is considered a ``regular'' file by UNIX. Within this category, however, it is useful to distinguish text files from binary files; within binary files there are executable files (which contain machine-language code) and data files; text files might be source files in a particular programming language (e.g. C or Java) or they may be humanreadable text in some mark-up language such as html (hypertext markup language). Data files may be classified according to the program that created them or is able to interpret them, e.g., a file may be a Microsoft Word document or Excel spreadsheet or the output of TeX. The possibilities are endless. In general (not just in UNIX) there are three ways of indicating the type of a file: 92 1. The operating system may record the type of a file in meta-data stored separately from the file, but associated with it. UNIX only provides enough meta-data to distinguish a regular file from a directory (or special file), but other systems support more types. 2. The type of a file may be indicated by part of its contents, such as a header made up of the first few bytes of the file. In UNIX, files that store executable programs start with a two byte magic number that identifies them as executable and selects one of a variety of executable formats. In the original UNIX executable format, called the a.out format, the magic number is the octal number 0407, which happens to be the machine code for a branch instruction on the PDP-11 computer, one of the first computers to implement UNIX. The operating system could run a file by loading it into memory and jumping to the beginning of it. The 0407 code, interpreted as an instruction, jumps to the word following the 16-byte header, which is the beginning of the executable code in this format. The PDP-11 computer is extinct by now, but it lives on through the 0407 code! 3. The type of a file may be indicated by its name. Sometimes this is just a convention, and sometimes it's enforced by the OS or by certain programs. For example, the UNIX Java compiler refuses to believe that a file contains Java source unless its name ends with .java. Some systems enforce the types of files more vigorously than others. File types may be enforced  Not at all,  Only by convention,  By certain programs (e.g. the Java compiler), or  By the operating system itself. UNIX tends to be very lax in enforcing types. 7.3 Access Methods and Protection Many systems support various access modes for operations on a file such as sequential, random and indexed.  Sequential. Read or write the next record or next n bytes of the file. Usually, sequential access also allows a rewind operation. 93  Random. Read or write the nth record or bytes i through j. UNIX provides an equivalent facility by adding a seek operation to the sequential operations listed above. This packaging of operations allows random access but encourages sequential access.  Indexed. Read or write the record with a given key. In some cases, the ``key'' need not be unique--there can be more than one record with the same key. In this case, programs use a combination of indexed and sequential operations: Get the first record with a given key, then get other records with the same key by doing sequential reads. Note that access modes are distinct from file structure--e.g., a record-structured file can be accessed either sequentially or randomly--but the two concepts are not entirely unrelated. For example, indexed access mode only makes sense for indexed files. File Attributes. This is the area where there is the most variation among file systems. Attributes can also be grouped by general category. Ownership and Protection. Owner, owner's ``group,'' creator, access-control list (information about who can to what to this file, for example, perhaps the owner can read or modify it, other members of his group can only read it, and others have no access). Time stamps. Time created, time last modified, time last accessed, time the attributes were last changed, etc. UNIX maintains the last three of these. Some systems record not only when the file was last modified, but by whom. Sizes. Current size, size limit, ``high-water mark'', space consumed (which may be larger than size because of internal fragmentation or smaller because of various compression techniques). Type Information. As described above: File is ASCII, is executable, is a ``system'' file, is an Excel spread sheet, etc. Misc. Some systems have attributes describing how the file should be displayed when a directly is listed. For example MacOS records an icon to represent the file and the screen coordinates where it was last displayed. DOS has a ``hidden'' attribute meaning that the file is not normally shown. UNIX achieves a similar effect by convention: The ls program that is usually used to list files does not show files with names that start with a period unless you explicit request it to (with the -a option). UNIX records a fixed set of attributes in the meta-data associated with a file. If you want to record some fact about the file that is not included among the supported attributes, you have to 94 use one of the tricks listed above for recording type information: encode it in the name of the file, put it into the body of the file itself, or store it in a file with a related name (e.g. ``foo.attributes''). Other systems (notably MacOS and Windows NT) allow new attributes to be invented on the fly. In MacOS, each file has a resource fork, which is a list of (attribute-name, attribute-value) pairs. The attribute name can be any four-character string, and the attribute value can be anything at all. Indeed, some kinds of files put the entire ``contents'' of the file in an attribute and leave the ``body'' of the file (called the data fork) empty. Operations POSIX, a standard API (application programming interface) based on UNIX, provides the following operations (among others) for manipulating files: fd = open(name, operation) fd = creat(name, mode) status = close(fd) byte_count = read(fd, buffer, byte_count) byte_count = write(fd, buffer, byte_count) offset = lseek(fd, offset, whence) status = link(oldname, newname) status = unlink(name) status = stat(name, buffer) status = fstat(fd, buffer) status = utimes(name, times) status = chown(name, owner, group) or fchown(fd, owner, group) status = chmod(name, mode) or fchmod(fd, mode) status = truncate(name, size) or ftruncate(fd, size) Status. Many functions return a ``status'' which is either 0 for success or -1 for errors (there is another mechanism to get more information about went wrong). Other functions also use -1 as a return value to indicate an error. Name. A character-string name for a file. Fd. A ``file descriptor'', which is a small non-negative integer used as a short, temporary name for a file during the lifetime of a process. Buffer. The memory address of the start of a buffer for supplying or receiving data. Whence. One of three codes, signifying from start, from end, or from current location. Mode. A bit-mask specifying protection information. Operation. An integer code, one of read, write, read and write, and perhaps a few other possibilities such as append only. 95 The open call finds a file and assigns a decriptor to it. It also indicates how the file will be used by this process (read only, read/write, etc). The creat call is similar, but creates a new (empty) file. The mode argument specifies protection attributes (such as ``writable by owner but readonly by others'') for the new file. (Most modern versions of UNIX have merged creat into open by adding an optional mode argument and allowing the operation argument to specify that the file is automatically created if it doesn't already exist.) The close call simply announces that fd is no longer in use and can be reused for another open or creat. The read and write operations transfer data between a file and memory. The starting location in memory is indicated by the buffer parameter; the starting location in the file (called the seek pointer is wherever the last read or write left off. The result is the number of bytes transferred. For write it is normally the same as the byte_count parameter unless there is an error. For read it may be smaller if the seek pointer starts out near the end of the file. The lseek operation adjusts the seek pointer (it is also automatically updated by read and write). The specified offset is added to zero, the current seek pointer, or the current size of the file, depending on the value of whence. The function link adds a new name (alias) to a file, while unlink removes a name. There is no function to delete a file; the system automatically deletes it when there are no remaining names for it. The stat function retrieves meta-data about the file and puts it into a buffer (in a fixed, documented format), while the remaining functions can be used to update the meta-data: utimes updates time stamps, chown updates ownership, chmod updates protection information, and truncate changes the size (files can be make bigger by write, but only truncate can make them smaller). Most come in two flavors: one that take a file name and one that takes a descriptor for an open file. To learn more details about any of these functions, type something like man 2 lseek to any UNIX system. The `2' means to look in section 2 of the manual, where system calls are explained. Other systems have similar operations, and perhaps a few more. For example, indexed or indexed sequential files would require a version of seek to specify a key rather than an offset. It is also common to have a separate append operation for writing to the end of a file. 96 The User Interface to Directories We already talked about file names. One important feature that a file name should have is that it be unambiguous: There should be at most one file with any given name. The symmetrical condition, that there be at most one name for any given file, is not necessarily a good thing. Sometimes it is handy to be able to give multiple names to a file. When we consider implementation, we will describe two different ways to implement multiple names for a file, each with slightly different semantics. If there are a lot of files in a system, it may be difficult to avoid giving two files the same name, particularly if there are multiple uses independently making up names. One technique to assure uniqueness is to prefix each file name with the name (or user id) of the owner. In some early operating systems, that was the only assistance the system gave in preventing conflicts. A better idea is the hierarchical directory structure, first introduced by Multics, then popularized by UNIX, and now found in virtually every operating system. You probably already know about hierarchical directories, but we would like to describe them from an unusual point of view, and then explain how this point of view is equivalent to the more familiar version. Each file is named by a sequence of names. Although all modern operating systems use this technique, each uses a different character to separate the components of the sequence when displaying it as a character string. Multics uses `>', UNIX uses `/', DOS and its descendants use `\', and MacOS uses ':'. Sequences make it easy to avoid naming conflicts. First, assign a sequence to each user and only let him create files with names that start with that sequence. For example, we might be assigned the sequence (`ùsr'', ``solomon''), written in UNIX as /usr/solomon. So far, this is the same as just appending the user name to each file name. But it allows me to further classify my own files to prevent conflicts. When we start a new project, we can create a new sequence by appending the name of the project to the end of the sequence assigned to me, and then use this prefix for all files in the project. For example, we might choose /usr/solomon/cs537 for files associated with this course, and name them /usr/solomon/cs537/foo, /usr/solomon/cs537/bar, etc. As an extra aid, the system allows me to specify a ``default prefix'' and a short-hand for writing names that start with that prefix. In UNIX, we use the system call chdir to specify a prefix, and whenever we use a name that does not start with `/', the system automatically adds that prefix. 97 It is customary to think of the directory system as a directed graph, with names on the edges. Each path in the graph is associated with a sequence of names, the names on the edges that make up the path. For that reason, the sequence of names is usually called a path name. One node is designated as the root node, and the rule is enforced that there cannot be two edges with the same name coming out of one node. With this rule, we can use path name to name nodes. Start at the root node and treat the path name as a sequence of directions, telling us which edge to follow at each step. It may be impossible to follow the directions (because they tell us to use an edge that does not exist), but if is possible to follow the directions, they will lead us unambiguously to one node. Thus path names can be used as unambiguous names for nodes. In fact, as we will see, this is how the directory system is actually implemented. However, we think it is useful to think of ``path names'' simply as long names to avoid naming conflicts, since it clear separates the interface from the implementation. 7.4 Implementing File Systems Files We will assume that all the blocks of the disk are given block numbers starting at zero and running through consecutive integers up to some maximum. We will further assume that blocks with numbers that are near each other are located physically near each other on the disk (e.g., same cylinder) so that the arithmetic difference between the numbers of two blocks gives a good estimate how long it takes to get from one to the other. First let's consider how to represent an individual file. There are (at least!) four possibilities: Contiguous. The blocks of a file are the block numbered n, n+1, n+2, ..., m. We can represent any file with a pair of numbers: the block number of of first block and the length of the file (in blocks). The advantages of this approach are it is simple and the blocks of the file are all physically near each other on the disk and in order so that a sequential scan through the file will be fast. The problem with this organization is that you can only grow a file if the block following the last block in the file happens to be free. Otherwise, you would have to find a long enough run of free blocks to accommodate the new length of the file and copy it. As a practical matter, operating systems that use this organization require the maximum size of the file to be declared when it is created and pre-allocate space for the whole file. Even then, storage allocation has all the 98 problems we considered when studying main-memory allocation including external fragmentation. Linked List. A file is represented by the block number of its first block, and each block contains the block number of the next block of the file. This representation avoids the problems of the contiguous representation: We can grow a file by linking any disk block onto the end of the list, and there is no external fragmentation. However, it introduces a new problem: Random access is effectively impossible. To find the 100th block of a file, we have to read the first 99 blocks just to follow the list. We also lose the advantage of very fast sequential access to the file since its blocks may be scattered all over the disk. However, if we are careful when choosing blocks to add to a file, we can retain pretty good sequential access performance. Both the space overhead (the percentage of the space taken up by pointers) and the time overhead (the percentage of the time seeking from one place to another) can be decreased by using larger blocks. The hardware designer fixes the block size (which is usually quite small) but the software can get around this problem by using ``virtual'' blocks, sometimes called clusters. The OS simply treats each group of (say) four continguous phyical disk sectors as one cluster. Large, clusters, particularly if they can be variable size, are sometimes called extents. Extents can be thought of as a compromise between linked and contiguous allocation. Disk Index. The idea here is to keep the linked-list representation, but take the link fields out of the blocks and gather them together all in one place. This approach is used in the ``FAT'' file system of DOS, OS/2 and older versions of Windows. At some fixed place on disk, allocate an array “I” with one element for each block on the disk, and move the link field from block n to I[m]. The whole array of links, called a file access table (FAT) is now small enough that it can be read into main memory when the systems starts up. Accessing the 100th block of a file still requires walking through 99 links of a linked list, but now the entire list is in memory, so time to traverse it is negligible (recall that a single disk access takes as long as 10's or even 100's of thousands of instructions). This representation has the added advantage of getting the `òperating system'' stuff (the links) out of the pages of `ùser data''. The pages of user data are now full-size disk blocks, and lots of algorithms work better with chunks that are a power of two bytes long. Also, it means that the OS can prevent users (who are notorious for screwing things up) from getting their grubby hands on the system data. 99 The main problem with this approach is that the index array we can get quite large with modern disks. For example, consider a 2 GB disk with 2K blocks. There are million blocks, so a block number must be at least 20 bits. Rounded up to an even number of bytes, that's 3 bytes--4 bytes if we round up to a word boundary--so the array I is three or four megabytes. While that's not an excessive amount of memory given today's RAM prices, if we can get along with less, there are better uses for the memory. File Index. Although a typical disk may contain tens of thousands of files, only a few of them are open at any one time, and it is only necessary to keep index information about open files in memory to get good performance. Unfortunately the whole-disk index described in the previous paragraph mixes index information about all files for the whole disk together, making it difficult to cache only information about open files. The inode structure introduced by UNIX groups together index information about each file individually. The basic idea is to represent each file as a tree of blocks, with the data blocks as leaves. Each internal block (called an indirect block in UNIX jargon) is an array of block numbers, listing its children in order. If a disk block is 2K bytes and a block number is four bytes, 512 block numbers fit in a block, so a one-level tree (a single root node pointing directly to the leaves) can accommodate files up to 512 blocks, or one megabyte in size. If the root node is cached in memory, the `àddress'' (block number) of any block of the file can be found without any disk accesses. A two-level tree, with 513 total indirect blocks, can handle files 512 times as large (up to one-half gigabyte). The only problem with this idea is that it wastes space for small files. Any file with more than one block needs at least one indirect block to store its block numbers. A 4K file would require three 2K blocks, wasting up to one third of its space. Since many files are quite small, this is serious problem. The UNIX solution is to use a different kind of ``block'' for the root of the tree. An index node (or inode for short) contains almost all the meta-data about a file listed above: ownership, permissions, time stamps, etc. (but not the file name). Inodes are small enough that several of them can be packed into one disk block. In addition to the meta-data, an inode contains the block numbers of the first few blocks of the file. What if the file is too big to fit all its block numbers into the inode? The earliest version of UNIX had a bit in the meta-data to indicate whether the file was ``small'' or ``big.'' For a big file, the inode contained the block numbers of indirect blocks rather than data blocks. More recent versions of UNIX contain pointers to indirect blocks in addition to the pointers to the first few data blocks. The inode contains pointers to (i.e., block numbers of) the first few blocks of the file, a pointer to an 100 indirect block containing pointers to the next several blocks of the file, a pointer to a doubly indirect block, which is the root of a two-level tree whose leaves are the next blocks of the file, and a pointer to a triply indirect block. A large file is thus a lop-sided tree. A real-life example is given by the Solaris 2.5 version of UNIX. Block numbers are four bytes and the size of a block is a parameter stored in the file system itself, typically 8K (8192 bytes), so 2048 pointers fit in one block. An inode has direct pointers to the first 12 blocks of the file, as well as pointers to singly, doubly, and triply indirect blocks. A file of up to 12+2048+2048*2048 = 4,196,364 blocks or 34,376,613,888 bytes (about 32 GB) can be represented without using triply indirect blocks, and with the triply indirect block, the maximum file size is (12+2048+2048*2048+2048*2048* 2048)*8192 = 70,403,120,791,552 bytes (slightly more than 246 bytes, or about 64 terabytes). Of course, for such huge files, the size of the file cannot be represented as a 32-bit integer. Modern versions of UNIX store the file length as a 64-bit integer, called a ``long'' integer in Java. An inode is 128 bytes long, allowing room for the 15 block pointers plus lots of meta-data. 64 inodes fit in one disk block. Since the inode for a file is kept in memory while the file is open, locating an arbitrary block of any file requires reading at most three I/O operations, not counting the operation to read or write the data block itself. Directories A directory is simply a table mapping character-string with human-readable names to information about files. The early PC operating system CP/M shows how simple a directory can be. Each entry contains the name of one file, its owner, size (in blocks) and the block numbers of 16 blocks of the file. To represent files with more than 16 blocks, CP/M used multiple directory entries with the same name and different values in a field called the extent number. CP/M had only one directory for the entire system. DOS uses a similar directory entry format, but stores only the first block number of the file in the directory entry. The entire file is represented as a linked list of blocks using the disk index scheme described above. All but the earliest version of DOS provide hierarchical directories using a scheme similar to the one used in UNIX. UNIX has an even simpler directory format. A directory entry contains only two fields: a character-string name (up to 14 characters) and a two-byte integer called an inumber, which is interpreted as an index into an array of inodes in a fixed, known location on disk. All the remaining information about the file (size, ownership, time stamps, permissions, and an index to 101 the blocks of the file) are stored in the inode rather than the directory entry. A directory is represented like any other file (there's a bit in the inode to indicate that the file is a directory). Thus the inumber in a directory entry may designate a ``regular'' file or another directory, allowing arbitrary graphs of nodes. However, UNIX carefully limits the set of operating system calls to ensure that the set of directories is always a tree. The root of the tree is the file with inumber 1 (some versions of UNIX use other conventions for designating the root directory). The entries in each directory point to its children in the tree. For convenience, each directory also two special entries: an entry with name ``..'', which points to the parent of the directory in the tree and an entry with name ``.'', which points to the directory itself. Inumber 0 is not used, so an entry is marked `ùnused'' by setting its inumber field to 0. The algorithm to convert from a path name to an inumber might be written in Java as follows. int namei(int current, String[] path) { for (int i = 0; i<path.length; i++) { if (inode[current].type != DIRECTORY) throw new Exception("not a directory"); current = nameToInumber(inode[current], path[i]); if (current == 0) throw new Exception("no such file or directory"); } return current; } The procedure nameToInumber(Inode node, String name) (not shown) reads through the directory file represented by the inode node, looks for an entry matching the given name and returns the inumber contained in that entry. The procedure namei walks the directory tree, starting at a given inode and following a path described by a sequence of strings. There is a procedure with this name in the UNIX kernel. Files are always specified in UNIX system calls by a character-string path name. You can learn the inumber of a file if you like, but you can't use the inumber when talking to the UNIX kernel. Each system call that has a path name as an argument uses namei to translate it to an inumber. If the argument is an absolute path name (it starts with `/'), namei is called with current == 1. Otherwise, current is the current working directory. Since all the information about a file except its name is stored in the inode, there can be more than one directory entry designating the same file. This allows multiple aliases (called links) for a file. UNIX provides a system call link (old-name, new-name) to create new names for existing files. The call link ("/a/b/c", "/d/e/f") works something like this: 102 if (namei(1, parse("/d/e/f")) != 0) throw new Exception("file already exists"); int dir = namei(1, parse("/d/e")): if (dir==0 || inode[dir].type != DIRECTORY) throw new Exception("not a directory"); int target = namei(1, parse("/a/b/c")); if (target==0) throw new Exception("no such directory"); if (inode[target].type == DIRECTORY) throw new Exception("cannot link to a directory"); addDirectoryEntry(inode[dir], target, "f"); The procedure parse (not shown here) is assumed to break up a path name into its components. If, for example, /a/b/c resolves to inumber 123, the entry (123, "f") is added to directory file designated by "/d/e". The result is that both "/a/b/c" and "/d/e/f" resolve to the same file (the one with inumber 123). We have seen that a file can have more than one name. What happens if it has no names (does not appear in any directory)? Since the only way to name a file in a system call is by a path name, such a file would be useless. It would consume resources (the inode and probably some data and indirect blocks) but there would be no way to read it, write to it, or even delete it. UNIX protects against this ``garbage collection'' problem by using reference counts. Each inode contains a count of the number of directory entries that point to it. `Ùser'' programs are not allowed to update directories directly. System calls that add or remove directory entries (creat, link, mkdir, rmdir, etc) update these reference counts appropriately. There is no system call to delete a file, only the system call unlink(name) which removes the directory entry corresponding to name. If the reference count of an inode drops to zero, the system automatically deletes the files and returns all of its blocks to the free list. We saw before that the reference counting algorithm for garbage collection has a fatal flaw: If there are cycles, reference counting will fail to collect some garbage. UNIX avoids this problem by making sure cycles cannot happen. The system calls are designed so that the set of directories will always be a single tree rooted at inode 1: mkdir creates a new empty (except for the . and .. entries) as a leaf of the tree, rmdir is only allowed to delete a directory that is empty (except for the . and .. entries), and link is not allowed to link to a directory. Because links to directories are not allowed, the only place the file system is not a tree is at the leaves (regular files) and that cannot introduce cycles. 103 Although this algorithm provides the ability to create aliases for files in a simple and secure manner, it has several flaws:  It's hard to figure own how to charge users for disk space. Ownership is associated with the file not the directory entry (the owner's id is stored in the inode). A file cannot be deleted without finding all the links to it and deleting them. If we create a file and you make a link to it, we will continue to be charged for it even if we try to remove it through my original name for it. Worse still, your link may be in a directory we don't have access to, so we may be unable to delete the file, even though we being charged for its space. Indeed, you could make it much bigger after we have no access to it.  There is no way to make an alias for a directory.  As we will see later, links cannot cross boundaries of physical disks.  Since all aliases are equal, there's no one ``true name'' for a file. You can find out whether two path names designate the same file by comparing inumbers. There is a system call to get the meta-data about a file, and the inumber is included in that information. But there is no way of going in the other direction: to get a path name for a file given its inumber, or to find a path name of an open file. Even if you remember the path name used to get to the file, that is not a reliable ``handle'' to the file (for example to link two files together by storing the name of one in the other). One of the components of the path name could be removed, thus invalidating the name even though the file still exists under a different name. While it's not possible to find the name (or any name) of an arbitrary file, it is possible to figure out the name of a directory. Directories do have unique names because the directories form a tree, and one of the properties of a tree is that there is a unique path from the root to any node. The ``..'' and ``.'' entries in each directory make this possible. Here, for example, is code to find the name of the current working directory. class DirectoryEntry { int inumber; String name; } String cwd() { FileInputStream thisDir = new FileInputStream("."); int thisInumber = nameToInumber(thisDir, "."); getPath(".", thisInumber); } String getPath(String currentName, int currentInumber) { 104 String parentName = currentName + "/.."; FileInputSream parent = new FileInputStream(parentName); int parentInumber = nameToInumber(parent, "."); String fname = inumberToName(parent, currentInumber); if (parentInumber == 1) return "/" + fname; else return getPath(parentInumber, parentName) + "/" + fname; } The procedure nameToInumber is similar to the procedure with the same name described above, but takes an InputStream as an argument rather than an inode. Many versions of UNIX allow a program to open a directory for reading and read its contents just like any other file. In such systems, it would be easy to write nameToInumber as a user-level procedure if you know the format of a directory. The procedure inumberToName is similar, but searches for an entry containing a particular inumber and returns the name field of the entry. Symbolic Links To get around the limitations with the original UNIX notion of links, more recent versions of UNIX introduced the notion of a symbolic link (to avoid confusion, the original kind of link, described in the previous section, is sometimes called a hard link). A symbolic link is a new type of file, distinguished by a code in the inode from directories, regular files, etc. When the namei procedure that translates path names to inumbers encounters a symlink, it treats the contents of the file as a pathname and uses it to continue the translation. If the contents of the file is a relative path name (it does not start with a slash), it is interpreted relative to the directory containing the link itself, not the current working directory of the process doing the lookup. int namei(int current, String[] path) { for (int i = 0; i<path.length; i++) { if (inode[current].type != DIRECTORY) throw new Exception("not a directory"); current = nameToInumber(inode[current], path[i]); if (current == 0) throw new Exception("no such file or directory"); while (inode[current].type == SYMLINK) { String link = getContents(inode[current]); String[] linkPath = parse(link); if (link.charAt(0) == '/') current = namei(1, linkPath); else current = namei(current, linkPath); if (current == 0) throw new Exception("no such file or directory"); } } return current; } 105 The only change from the previous version of this procedure is the addition of the while loop. Any time the procedure encounters a node of type SYMLINK, it recursively calls itself to translate the contents of the file, interpreted as a path name, into an inumber. 7.5 Implementation Although the implementation looks complicated, it does just what you would expect in normal situations. For example, suppose there is an existing file named /a/b/c and an existing directory /d. Then the command ln -s /a/b /d/e makes the path name /d/e a synonym for /a/b, and also makes /d/e/c a synonym for /a/b/c. From the user's point of view, the the picture looks like this: In implementation terms, the picture looks like the picture below. The hexagon denotes a node of type symlink. Here's a more elaborate example that illustrates symlinks with relative path names. Suppose we have an existing directory /usr/solomon/cs537/s90 with various sub-directories and we are setting up project 5 for this semester. We might do something like the following commands and the logical and physical links are shown in the picture below. All three of the cat commands refer to the same file. cd /usr/solomon/cs537 mkdir f96 106 cd f96 ln -s ../s90/proj5 proj5.old cat proj5.old/foo.c cd /usr/solomon/cs537 cat f96/proj5.old/foo.c cat s90/proj5/foo.c Logical link of the above command Physical link of the above command The added flexibility of symlinks over hard links comes at the expense of less security. Symlinks are neither required nor guaranteed to point to valid files. You can remove a file out from under a symlink, and in fact, you can create a symlink to a non-existent file. Symlinks can also have cycles. For example, this works fine: cd /usr/solomon mkdir bar ln -s /usr/solomon foo ls /usr/solomon/foo/foo/foo/foo/bar 107 However, in some cases, symlinks can cause infinite loops or infinite recursion in the namei procedure. The real version in UNIX puts a limit on how many times it will iterate and returns an error code of ``too many links'' if the limit is exceeded. Symlinks to directories can also cause the ``change directory'' command cd to behave in strange ways. Most people expect that the two commands cd foo cd .. to cancel each other out. But in the last example, the commands cd /usr/solomon cd foo cd .. would leave you in the directory /usr. Some shell programs treat cd specially and remember what alias you used to get to the current directory. After cd /usr/solomon; cd foo; cd foo, the current directory is /usr/solomon/foo/foo, which is an alias for /usr/solomon, but the command cd .. is treated as if you had typed cd /usr/solomon/foo. Mounting What if your computer has more than one disk? In many operating systems (including DOS and its descendants) a pathname starts with a device name, as in C:\usr\solomon (by convention, C is the name of the default hard disk). If you leave the device prefix off a path name, the system supplies a default current device similar to the current directory. UNIX allows you to glue together the directory trees of multiple disks to create a single unified tree. There is a system call mount(device, mount_point) where device names a particular disk drive and mount_point is the path name of an existing node in the current directory tree (normally an empty directory). The result is similar to a hard link: The mount point becomes an alias for the root directory of the indicated disk. Here's how it works: The kernel maintains a table of existing mounts represented as (device1, inumber, device2) triples. During namei, whenever the current (device, inumber) pair matches the first two fields in one of the entries, the current device and inumber become device2 and 1, respectively. Here's the expanded code: int namei(int curi, int curdev, String[] path) { for (int i = 0; i<path.length; i++) { if (disk[curdev].inode[curi].type != DIRECTORY) throw new Exception("not a directory"); curi = nameToInumber(disk[curdev].inode[curi], path[i]); if (curi == 0) throw new Exception("no such file or directory"); 108 while (disk[curdev].inode[curi].type == SYMLINK) { String link = getContents(disk[curdev].inode[curi]); String[] linkPath = parse(link); if (link.charAt(0) == '/') current = namei(1, linkPath); else current = namei(current, linkPath); if (current == 0) throw new Exception("no such file or directory"); } int newdev = mountLookup(curdev, curi); if (newdev != -1) { curdev = newdev; curi = 1; } } return current; } In this code, we assume that mountLookup searches the mount table for matching entry, returning -1 if no matching entry is found. There is a also a special case (not shown here) for ``..'' so that the ``..'' entry in the root directory of a mounted disk behaves like a pointer to the parent directory of the mount point. The Network File System (NFS) from Sun Microsystems extends this idea to allow you to mount a disk from a remote computer. The device argument to the mount system call names the remote computer as well as the disk drive and both pieces of information are put into the mount table. Now there are three pieces of information to define the ``current directory'': the inumber, the device, and the computer. If the current computer is remote, all operations (read, write, create, delete, mkdir, rmdir, etc.) are sent as messages to the remote computer. Information about remote open files, including a seek pointer and the identity of the remote machine, is kept locally. Each read or write operation is converted locally to one or more requests to read or write blocks of the remote file. NFS caches blocks of remote files locally to improve performance. Special Files We said that the UNIX mount system call has the name of a disk device as an argument. How do you name a device? The answer is that devices appear in the directory tree as special files. An inode whose type is ``special'' (as opposed to ``directory,'' ``symlink,'' or ``regular'') represents some sort of I/O device. It is customary to put special files in the directory /dev, but since it is the inode that is marked ``special,'' they can be anywhere. Instead of containing pointers to disk blocks, the inode of a special file contains information (in a machine-dependent format) about 109 the device. The operating system tries to make the device look as much like a file as possible, so that ordinary programs can open, close, read, or write the device just like a file. Some devices look more like real file than others. A disk device looks exactly like a file. Reads return whatever is on the disk and writes can scribble anywhere on the disk. For obvious security reasons, the permissions for the raw disk devices are highly restrictive. A tape drive looks sort of like a disk, but a read will return only the next physical block of data on the device, even if more is requested. The special file /dev/tty represent the terminal. Writes to /dev/tty display characters on the screen. Reads from /dev/tty return characters typed on the keyboard. The seek operation on a device like /dev/tty updates the seek pointer, but the seek pointer has no effect on reads or writes. Reads of /dev/tty are also different from reads of a file in that they may return fewer bytes than requested: Normally, a read will return characters only up through the next end-of-line. If the number of bytes requested is less than the length of the line, the next read will get the remaining bytes. A read call will block the caller until at least one character can be returned. On machines with more than one terminal, there are multiple terminal devices with names like /dev/tty0, /dev/tty1, etc. Some devices, such as a mouse, are read-only. Write operations on such devices have no effect. Other devices, such as printers, are write-only. Attempts to read from them give an end-of-file indication (a return value of zero). There is special file called /dev/null that does nothing at all: reads return end-of-file and writes send their data to the garbage bin. (New EPA rules require that this data be recycled. It is now used to generate federal regulations and other meaningless documents.) One particularly interesting device is /dev/mem, which is an image of the memory space of the current process. In a sense, this device is the exact opposite of memory-mapped files. Instead of making a file look like part of virtual memory, it makes virtual memory look like a device. This idea of making all sorts of things look like files can be very powerful. Some versions of UNIX make network connections look like files. Some versions have a directory with one special file for each active process. You can read these files to get information about the states of processes. If you delete one of these files, the corresponding process is killed. Another idea is to have a directory with one special file for each print job waiting to be printed. Although this idea was pioneered by UNIX, it is starting to show up more and more in other operating systems. 110 Long File Names The UNIX implementation described previously allows arbitrarily long path names for a files, but each component is limited in length. In the original UNIX implementation, each directory entry is 16 bytes long: two bytes for the inumber and 14 bytes for a path name component. class Dirent { public short inumber; public byte name[14]; } If the name is less than 14 characters long, trailing bytes are filled with nulls (bytes with all bits set to zero--not to be confused with `0' characters). An inumber of zero is used to mark an entry as unused (inumbers for files start at 1).  To look up a name, search the whole directory, starting at the beginning.  To ``remove'' an entry, set its inumber flag to zero.  To add an entry, search for an entry with a zero inumber field and re-use it. If there aren't any, add an entry to the end (making the file 16 bytes bigger). This representation has one advantage.  It is very simple. In particular, space allocation is easy because all entries are the same length. However, it has several disadvantages.  Since an inumber is only 16 bits, there can be at most 65,535 files on any one disk.  A file name can be at most 14 characters long.  Directories grow, but they never shrink.  Searching a very large directory can be slow. The people at Berkeley, while they were rewriting the file system code to make it faster, also changed the format of directories to get rid of the first two problems (they left the remaining problems unfixed). This new organization has been adopted by many (but not all) versions of UNIX introduced since then. The new format of a directory entry looks like this: class DirentLong { int inumber; short reclen; 111 short namelen; byte name[]; } The inumber field is now a 4-byte (32-bit) integer, so that a disk can have up to 4,294,967,296 files. The reclen field indicates the entire length of the DirentLong entry, including the 8-byte header. The actual length of the name array is thus reclen - 8 bytes. The namelen field indicates the length of the name. The remaining space in the name array is unused. This extra padding at the end of the entry serves three purposes.  It allows the length of the entry to be padded up to a multiple of 4 bytes so that the integer fields are properly aligned (some computer architectures require integers to be stored at addresses that are multiples of 4).  The last entry in a disk block can be padded to make it extend to the end of the block. With this trick, UNIX avoids entries that cross block boundaries, simplifying the code.  It supports a cute trick for coalescing free space. To delete an entry, simply increase the size of the previous entry by the size of the entry being deleted. The deleted entry looks like part of the padding on the end of the previous entry. Since all searches of the directory are done sequentially, starting at the beginning, the deleted entry will effectively ``disappear.'' There's only one problem with this trick: It can't be used to delete the first entry in the directory. Fortunately, the first entry is the `.' entry, which is never deleted. To create a new entry, search the directory for an entry that has enough padding (according to its reclen and namelen fields) to hold the new entry and split it into two entries by decreasing its reclen field. If no entry with enough padding is found, extend the directory file by one block, make the whole block into one entry, and try again. This approach has two very minor additional benefits over the old scheme. In the old scheme, every entry is 16 bytes, even if the name is only one byte long. In the new scheme, a name uses only as much space as it needs (although this doesn't save much, since the minimum size of an entry in the new scheme is 9 bytes--12 if padding is used to align entries to integer boundaries). The new approach also allows nulls to appear in file names, but other parts of the system make that impractical, and besides, who cares? 112 Block Size and Extents All of the file organizations we've mentioned store the contents of a file in a set of disk blocks. How big should a block be? The problem with small blocks is I/O overhead. There is a certain overhead to read or write a block beyond the time to actually transfer the bytes. If we double the block size, a typical file will have half as many blocks. Reading or writing the whole file will transfer the same amount of data, but it will involve half as many disk I/O operations. The overhead for an I/O operations includes a variable amount of latency (seek time and rotational delay) that depends on how close the blocks are to each other, as well as a fixed overhead to start each operation and respond to the interrupt when it completes. Many years ago, researchers at the University of California at Berkeley studied the original UNIX file system. They found that when they tried reading or writing a single very large file sequentially, they were getting only about 2% of the potential speed of the disk. In other words, it took about 50 times as long to read the whole file as it would if they simply read that many sequential blocks directly from the raw disk (with no file system software). They tried doubling the block size (from 512 bytes to 1K) and the performance more than doubled! The reason the speed more than doubled was that it took less than half as many I/O operations to read the file. Because the blocks were twice as large, twice as much of the file's data was in blocks pointed to directly by the inode. Indirect blocks were twice as large as well, so they could hold twice as many pointers. Thus four times as much data could be accessed through the singly indirect block without resorting to the doubly indirect block. If doubling the block size more than doubled performance, why stop there? Why didn't the Berkeley folks make the blocks even bigger? The problem with big blocks is internal fragmentation. A file can only grow in increments of whole blocks. If the sizes of files are random, we would expect on the average that half of the last block of a file is wasted. If most files are many blocks long, the relative amount of waste is small, but if the block size is large compared to the size of a typical file, half a block per file is significant. In fact, if files are very small (compared to the block size), the problem is even worse. If, for example, we choose a block size of 8k and the average file is only 1K bytes long, we would be wasting about 7/8 of the disk. Most files in a typical UNIX system are very small. The Berkeley researchers made a list of the sizes of all files on a typical disk and did some calculations of how much space would be wasted 113 by various block sizes. Simply rounding the size of each file up to a multiple of 512 bytes resulted in wasting 4.2% of the space. Including overhead for inodes and indirect blocks, the original 512-byte file system had a total space overhead of 6.9%. Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the overhead would be 22.4% and with 4k blocks it would be 45.6%. Would 4k blocks be worthwhile? The answer depends on economics. In those days disks were very expensive, and a wasting half the disk seemed extreme. These days, disks are cheap, and for many applications people would be happy to pay twice as much per byte of disk space to get a disk that was twice as fast. But there's more to the story. The Berkeley researchers came up with the idea of breaking up the disk into blocks and fragments. For example, they might use a block size of 2k and a fragment size of 512 bytes. Each file is stored in some number of whole blocks plus 0 to 3 fragments at the end. The fragments at the end of one file can share a block with fragments of other files. The problem is that when we want to append to a file, there may not be any space left in the block that holds its last fragment. In that case, the Berkeley file system copies the fragments to a new (empty) block. A file that grows a little at a time may require each of its fragments to be copied many times. They got around this problem by modifying application programs to buffer their data internally and add it to a file a whole block's worth at a time. In fact, most programs already used library routines to buffer their output (to cut down on the number of system calls), so all they had to do was to modify those library routines to use a larger buffer size. This approach has been adopted by many modern variants of UNIX. The Solaris system you are using for this course uses 8k blocks and 1K fragments. As disks get cheaper and CPU's get faster, wasted space is less of a problem and the speed mismatch between the CPU and the disk gets worse. Thus the trend is towards larger and larger disk blocks. At first glance it would appear that the OS designer has no say in how big a block is. Any particular disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use larger ``blocks''. For example, if we think it would be a good idea to use 2K blocks, we can group together each run of four consecutive sectors and call it a block. In fact, it would even be possible to use variable-sized ``blocks,'' so long as each one is a multiple of the sector size. A variable-sized ``block'' is called an extent. When extents are used, they are usually used in addition to multi-sector blocks. For example, a system may use 2k blocks, each consisting of 4 consecutive sectors, and then group them into extents of 1 to 10 blocks. When a file is opened for 114 writing, it grows by adding an extent at a time. When it is closed, the unused blocks at the end of the last extent are returned to the system. The problem with extents is that they introduce all the problems of external fragmentation that we saw in the context of main memory allocation. Extents are generally only used in systems such as databases, where high-speed access to very large files is important. Free Space We have seen how to keep track of the blocks in each file. How do we keep track of the free blocks--blocks that are not in any file? There are two basic approaches.  Use a bit vector. That is simply an array of bits with one bit for each block on the disk. A 1 bit indicates that the corresponding block is allocated (in some file) and a 0 bit says that it is free. To allocate a block, search the bit vector for a zero bit, and set it to one.  Use a free list. The simplest approach is simply to link together the free blocks by storing the block number of each free block in the previous free block. The problem with this approach is that when a block on the free list is allocated, you have to read it into memory to get the block number of the next block in the list. This problem can be solved by storing the block numbers of additional free blocks in each block on the list. In other words, the free blocks are stored in a sort of lopsided tree on disk. If, for example, 128 block numbers fit in a block, 1/128 of the free blocks would be linked into a list. Each block on the list would contain a pointer to the next block on the list, as well as pointers to 127 additional free blocks. When the first block of the list is allocated to a file, it has to be read into memory to get the block numbers stored in it, but then we and allocate 127 more blocks without reading any of them from disk. Freeing blocks is done by running this algorithm in reverse: Keep a cache of 127 block numbers in memory. When a block is freed, add its block number to this cache. If the cache is full when a block is freed, use the block being freed to hold all the block numbers in the cache and link it to the head of the free list by adding to it the block number of the previous head of the list. How do these methods compare? Neither requires significant space overhead on disk. The bitmap approach needs one bit for each block. Even for a tiny block size of 512 bytes, each bit of the bitmap describes 512*8 = 4096 bits of free space, so the overhead is less than 1/40 of 1%. The free list is even better. All the pointers are stored in blocks that are free anyhow, so there is no space overhead (except for one pointer to the head of the list). Another way of looking at this 115 is that when the disk is full (which is the only time we should be worried about space overhead!) the free list is empty, so it takes up no space. The real advantage of bitmaps over free lists is that they give the space allocator more control over which block is allocated to which file. Since the blocks of a file are generally accessed together, we would like them to be near each other on disk. To ensure this clustering, when we add a block to a file we would like to choose a free block that is near the other blocks of a file. With a bitmap, we can search the bitmap for an appropriate block. With a free list, we would have to search the free list on disk, which is clearly impractical. Of course, to search the bitmap, we have to have it all in memory, but since the bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the entire bitmap in memory all the time. To do the comparable operation with a free list, we would need to keep the block numbers of all free blocks in memory. If a block number is four bytes (32 bits), that means that 32 times as much memory would be needed for the free list as for a bitmap. For a concrete example, consider a 2 gigabyte disk with 8K blocks and 4-byte block numbers. The disk contains 231/213 = 218 = 262,144 blocks. If they are all free, the free list has 262,144 entries, so it would take one megabyte of memory to keep them all in memory at once. By contrast, a bitmap requires 218 bits, or 215 = 32K bytes (just four blocks). (On the other hand, the bit map takes the same amount of memory regardless of the number of blocks that are free). Reliability Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile memory. There are several techniques that can be used to mitigate the effects of these failures. We only have room for a brief survey. Bad-block Forwarding When the disk drive writes a block of data, it also writes a checksum, a small number of additional bits whose value is some function of the `ùser data'' in the block. When the block is read back in, the checksum is also read and compared with the data. If either the data or checksum were corrupted, it is extremely unlikely that the checksum comparison will succeed. Thus the disk drive itself has a way of discovering bad blocks with extremely high probability. The hardware is also responsible for recovering from bad blocks. Modern disk drives do automatic bad-block forwarding. The disk drive or controller is responsible for mapping block numbers to absolute locations on the disk (cylinder, track, and sector). It holds a little bit of space in reserve, not mapping any block numbers to this space. When a bad block is discovered, the 116 disk allocates one of these reserved blocks and maps the block number of the bad block to the replacement block. All references to this block number access the replacement block instead of the bad block. There are two problems with this scheme. First, when a block goes bad, the data in it is lost. In practice, blocks tend to be bad from the beginning, because of small defects in the surface coating of the disk platters. There is usually a stand-alone formatting program that tests all the blocks on the disk and sets up forwarding entries for those that fail. Thus the bad blocks never get used in the first place. The main reason for the forwarding is that it is just too hard (expensive) to create a disk with no defects. It is much more economical to manufacture a ``pretty good'' disk and then use bad-block forwarding to work around the few bad blocks. The other problem is that forwarding interferes with the OS's attempts to lay out files optimally. The OS may think it is doing a good job by assigning consecutive blocks of a file to consecutive block numbers, but if one of those blocks is forwarded, it may be very far away for the others. In practice, this is not much of a problem since a disk typically has only a handful of forwarded sectors out of millions. The software can also help avoid bad blocks by simply leaving them out of the free list (or marking them as allocated in the allocation bitmap). Back-up Dumps There are a variety of storage media that are much cheaper than (hard) disks but are also much slower. An example is 8 millimeter video tape. A ``two-hour'' tape costs just a few dollars and can hold two gigabytes of data. By contrast, a 2GB hard drive currently casts several hundred dollars. On the other hand, while worst-case access time to a hard drive is a few tens of milliseconds, rewinding or fast-forwarding a tape to desired location can take several minutes. One way to use tapes is to make periodic back up dumps. Dumps are really used for two different purposes:  To recover lost files. Files can be lost or damaged by hardware failures, but far more often they are lost through software bugs or human error (accidentally deleting the wrong file). If the file is saved on tape, it can be restored.  To recover from catastrophic failures. An entire disk drive can fail, or the whole computer can be stolen, or the building can burn down. If the contents of the disk have been saved to tape, the data can be restored (to a repaired or replacement disk). All that is lost is the work that was done since the information was dumped. 117 Corresponding to these two ways of using dumps, there are two ways of doing dumps. A physical dump simply copies all of the blocks of the disk, in order, to tape. It's very fast, both for doing the dump and for recovering a whole disk, but it makes it extremely slow to recover any one file. The blocks of the file are likely to be scattered all over the tape, and while seeks on disk can take tens of milliseconds, seeks on tape can take tens or hundreds of seconds. The other approach is a logical dump, which copies each file sequentially. A logical dump makes it easy to restore individual files. It is even easier to restore files if the directories are dumped separately at the beginning of the tape, or if the name(s) of each file are written to the tape along with the file. The problem with logical dumping is that it is very slow. Dumps are usually done much more frequently than restores. For example, you might dump your disk every night for three years before something goes wrong and you need to do a restore. An important trick that can be used with logical dumps is to only dump files that have changed recently. An incremental dump saves only those files that have been modified since a particular date and time. Fortunately, most file systems record the time each file was last modified. If you do a backup each night, you can save only those files that have changed since the last backup. Every once in a while (say once a month), you can do a full backup of all files. In UNIX jargon, a full backup is called an epoch (pronounced `èepock'') dump, because it dumps everything that has changed since ``the epoch''-January 1, 1970, which is the earliest possible date in UNIX. The Computer Sciences department currently does backup dumps on about 260 GB of disk space. Epoch dumps are done once every 14 days, with the timing on different file systems staggered so that about 1/14 of the data is dumped each night. Daily incremental dumps save about 6-10% of the data on each file system. Incremental dumps go fast because they dump only a small fraction of the files, and they don't take up a lot of tape. However, they introduce new problems:  If you want to restore a particular file, you need to know when it was last modified so that you know which dump tape to look at.  If you want to restore the whole disk (to recover from a catastrophic failure), you have to restore from the last epoch dump, and then from every incremental dump since then, in order. A file that is modified every day will appear on every tape. Each restore will overwrite the file with a newer version. When you're done, everything will be up-to-date as of the last dump, but the whole process can be extremely slow (and labor-intensive). 118  You have to keep around all the incremental tapes since the last epoch. Tapes are cheap, but they're not free, and storing them can be a hassle. The First problem can be solved by keeping a directory of what was dumped when. A bunch of UW alumni (the same guys that invented NFS) have made themselves millionaires by marketing software to do this. The other problems can be solved by a clever trick. Each dump is assigned a positive integer level. A level n dump is an incremental dump that dumps all files that have changed since the most recent previous dump with a level greater than or equal to n. An epoch dump is considered to have infinitely high level. Levels are assigned to dumps as follows: This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps only save files that have changed in the previous day. Level-2 dumps save files that have changed in the last two days, level-3 dumps cover four days, level-4 dumps cover 8 days, etc. Higher-level dumps will thus include more files (so they will take longer to do), but they are done infrequently. The nice thing about this scheme is that you only need to save one tape from each level, and the number of levels is the logarithm of the interval between epoch dumps. Thus, even if did a dump each night and you only did an epoch dump only once a year, you would need only nine levels (hence nine tapes). That also means that a full restore needs at worst one restore from each of nine tapes (rather than 365 tapes!). To figure out what tapes you need to restore from if your disk is destroyed after dump number n, express n in binary, and number the bits from right to left, starting with 1. The 1 bits tell you which dump tapes to use. Restore them in order of decreasing level. For example, 20 in binary is 10100, so if the disk is destroyed after the 20th dump, you only need to restore from the epoch dump and from the most recent dumps at levels 5 and 3. Consistency Checking Some of the information in a file system is redundant. For example, the free list could be reconstructed by checking which blocks are not in any file. Redundancy arises because the same information is represented in different forms to make different operations faster. If you want to 119 know which blocks are in a given file, look at the inode. If you you want to know which blocks are not in any inode, use the free list. Unfortunately, various hardware and software errors can cause the data to become inconsistent. File systems often include a utility that checks for consistency and optionally attempts to repair inconsistencies. These programs are particularly handy for cleaning up the disks after a crash. UNIX has a utility called fscheck. It has two principal tasks. First, it checks that blocks are properly allocated. Each inode is supposed to be the root of a tree of blocks, the free list is supposed to be a tree of blocks, and each block is supposed to appear in exactly one of these trees. Fscheck runs through all the inodes, checking each allocated inode for reasonable values, and walking through the tree of blocks rooted at the inode. It maintains a bit vector to record which blocks have been encountered. If block is encountered that has already been seen, there is a problem: Either it occurred twice in the same file (in which case it isn't a tree), or it occurred in two different files. A reasonable recovery would be to allocate a new block, copy the contents of the problem block into it, and substitute the copy for the problem block in one of the two places where it occurs. It would also be a good idea to log an error message so that a human being can check up later to see what's wrong. After all the files are scanned, any block that hasn't been found should be on the free list. It would be possible to scan the free list in a similar manner, but it's probably easier just to rebuild the free list from the set of blocks that were not found in any file. If a bitmap instead of a free list is used, this step is even easier: Simply overwrite the file system's bitmap with the bitmap constructed during the scan. The other main consistency requirement concerns the directory structure. The set of directories is supposed to be a tree, and each inode is supposed to have a link count that indicates how many times it appears in directories. The tree structure could be checked by a recursive walk through the directories, but it is more efficient to combine this check with the walk through the inodes that checks for disk blocks, but recording, for each directory inode encountered, the inumber of its parent. The set of directories is a tree if and only if and only if every directory other than the root has a unique parent. This pass can also rebuild the link count for each inode by maintaining in memory an array with one slot for each inumber. Each time the inumber is found in a directory, increment the corresponding element of the array. The resulting counts should match the link counts in the inodes. If not, correct the counts in the inodes. This illustrates a very important principal that pops up throughout operating system implementation (indeed, throughout any large software system): the doctrine of hints and 120 absolutes. Whenever the same fact is recorded in two different ways, one of them should be considered the absolute truth, and the other should be considered a hint. Hints are handy because they allow some operations to be done much more quickly that they could if only the absolute information was available. But if the hint and the absolute do not agree, the hint can be rebuilt from the absolutes. In a well-engineered system, there should be some way to verify a hint whenever it is used. UNIX is a bit lax about this. The link count is a hint (the absolute information is a count of the number of times the inumber appears in directories), but UNIX treats it like an absolute during normal operation. As a result, a small error can snowball into completely trashing the file system. For another example of hints, each allocated block could have a header containing the inumber of the file containing it and its offset in the file. There are systems that do this (UNIX isn't one of them). The tree of blocks rooted at an inode then becomes a hint, providing an efficient way of finding a block, but when the block is found, its header could be checked. Any inconsistency would then be caught immediately, and the inode structures could be rebuilt from the information in the block headers. By the way, if the link count calculated by the scan is zero (i.e., the inode, although marked as allocated, does not appear in any directory), it would not be prudent to delete the file. A better recovery is to add an entry to a special lost+found directory pointing to the orphan inode, in case it contains something really valuable. Transactions The previous section talks about how to recover from situations that ``can't happen.'' How do these problems arise in the first place? Wouldn't it be better to prevent these problems rather than recover from them after the fact? Many of these problems arise, particularly after a crash, because some operation was ``half-completed.'' For example, suppose the system was in the middle of executing a unlink system call when the lights went out. An unlink operation involves several distinct steps:  remove an entry from a directory,  decrement a link count, and if the count goes to zero,  move all the blocks of the file to the free list, and  free the inode. 121 If the crash occurs between the first and second steps, the link count will be wrong. If it occurs during the third step, a block may be linked both into the file and the free list, or neither, depending on the details of how the code is written. And so on... To deal with this kind of problem in a general way, transactions were invented. Transactions were first developed in the context of database management systems, and are used heavily there, so there is a tradition of thinking of them as ``database stuff'' and teaching about them only in database courses and text books. But they really are an operating system concept. Here's a twobit introduction. We have already seen a mechanism for making complex operations appear atomic. It is called a critical section. Critical sections have a property that is sometimes called synchronization atomicity. It is also called serializability because if two processes try to execute their critical sections at about the same time, the next effect will be as if they occurred in some serial order. If systems can crash (and they can!), synchronization atomicity isn't enough. We need another property, called failure atomicity, which means an `àll or nothing'' property: Either all of the modifications of nonvolatile storage complete or none of them do. There are basically two ways to implement failure atomicity. They both depend on the fact that a writing a single block to disk is an atomic operation. The first approach is called logging. An append-only file called a log is maintained on disk. Each time a transaction does something to file-system data, it creates a log record describing the operation and appends it to the log. The log record contains enough information to undo the operation. For example, if the operation made a change to a disk block, the log record might contain the block number, the length and offset of the modified part of the block, and the the original content of that region. The transaction also writes a begin record when it starts, and a commit record when it is done. After a crash, a recovery process scans the log looking for transactions that started (wrote a begin record) but never finished (wrote a commit record). If such a transaction is found, its partially completed operations are undone (in reverse order) using the undo information in the log records. Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the cached copy and only written back out to disk from time to time. If the system crashes before the changes are written to disk, the data structures on disk may be inconsistent. Logging can also be used to avoid this problem by putting into each log record redo information as well as undo 122 information. For example, the log record for a modification of a disk block should contain both the old and new value. After a crash, if the recovery process discovers a transaction that has completed, it uses the redo information to make sure the effects of all of its operations are reflected on disk. Full recovery is always possible provided  The log records are written to disk in order,  The commit record is written to disk when the transaction completes, and  The log record describing a modification is written to disk before any of the changes made by that operation are written to disk. This algorithm is called write-ahead logging. The other way of implementing transactions is called shadow blocks. Suppose the data structure on disk is a tree. The basic idea is never to change any block (disk block) of the data structure in place. Whenever you want to modify a block, make a copy of it (called a shadow of it) instead, and modify the parent to point to the shadow. Of course, to make the parent point to the shadow you have to modify it, so instead you make a shadow of the parent an modify it instead. In this way, you shadow not only each block you really wanted to modify, but also all the blocks on the path from it to the root. You keep the shadow of the root block in memory. At the end of the transaction, you make sure the shadow blocks are all safely written to disk and then write the shadow of the root directly onto the root block. If the system crashes before you overwrite the root block, there will be no permanent change to the tree on disk. Overwriting the root block has the effect of linking all the modified (shadow blocks) into the tree and removing all the old blocks. Crash recovery is simply a matter of garbage collection. If the crash occurs before the root was overwritten, all the shadow blocks are garbage. If it occurs after, the blocks they replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the garbage blocks (they are blocks that aren't in the tree). Database systems almost universally use logging, and shadowing is mentioned only in passing in database texts. But the shadowing technique is used in a variant of the UNIX file system called (somewhat misleadingly) the Log-structured File System (LFS). The entire file system is made into a tree by replacing the array of inodes with a tree of inodes. LFS has the added advantage (beyond reliability) that all blocks are written sequentially, so write operations are very fast. It has the disadvantage that files that are modified here and there by random access tend to have their blocks scattered about, but that pattern of access is comparatively rare, and there are 123 techniques to cope with it when it occurs. The main source of complexity in LFS is figuring out when and how to do the ``garbage collection.'' Performance The main trick to improve file system performance (like anything else in computer science) is caching. The system keeps a disk cache (sometimes also called a buffer pool) of recently used disk blocks. In contrast with the page frames of virtual memory, where there were all sorts of algorithms proposed for managing the cache, management of the disk cache is pretty simple. On the whole, it is simply managed LRU (least recently used). Why is it that for paging we went to great lengths trying to come up with an algorithm that is `àlmost as good as LRU'' while here we can simply use true LRU? The problem with implementing LRU is that some information has to be updated on every single reference. In the case of paging, references can be as frequent as every instruction, so we have to make do with whatever information hardware is willing to give us. The best we can hope for is that the paging hardware will set a bit in a page-table entry. In the case of file system disk blocks, however, each reference is the result of a system call, and adding a few extra instructions added to a system call for cache maintenance is not unreasonable. Adding page caching to the file system implementation is actually quite simple. Somewhere in the implementation, there is probably a procedure that gets called when the system wants to access a disk block. Let's suppose the procedure simply allocates some memory space to hold the block and reads it into memory. Block readBlock(int blockNumber) { Block result = new Block(); Disk.read(blockNumber, result); return result; } To add caching, all we have to do is modify this code to search the disk cache first. class CacheEntry { int blockNumber; Block buffer; CacheEntry next, previous; } class DiskCache { CacheEntry head, tail; CacheEntry find(int blockNumber) { // Search the list for an entry with a matching block number. // If not found, return null. } void moveToFront(CacheEntry entry) { // more entry to the head of the list } 124 CacheEntry oldest() { return tail; } Block readBlock(int blockNumber) { Block result; CacheEntry entry = find(blockNumber); if (entry == null) { entry = oldest(); Disk.read(blockNumber, entry.buffer); entry.blockNumber = blockNumber; } moveToFront(entry); return entry.buffer; } } This code is not quite right, because it ignores writes. If the oldest buffer is dirty (it has been modified since it was read from disk), it first has to be written back to the disk before it can be used to hold the new block. Most systems actually write dirty buffers back to the disk sooner than necessary to minimize the damage caused by a crash. The original version of UNIX had a background process that would write all dirty buffers to disk every 30 seconds. Some information is more critical than others. Some versions of UNIX, for example, write back directory blocks (the data block of directory files of type directory) as each time they are modified. This technique--keeping the block in the cache but writing its contents back to disk after any modification--is called write-through caching. (Some modern versions of UNIX use techniques inspired by database transactions to minimize the effects of crashes). LRU management automatically does the ``right thing'' for most disk blocks. If someone is actively manipulating the files in a directory, all of the directory's blocks will probably be in the cache. If a process is scanning a large file, all of its indirect blocks will probably be in memory most of the time. But there is one important case where LRU is not the right policy. Consider a process that is traversing (reading or writing) a file sequentially from beginning to end. Once that process has read or written the last byte of a block, it will not touch that block again. The system might as well immediately move the block to the tail of the list as soon as the read or write request completes. Tanenbaum calls this technique free behind. It is also sometimes called most recently used (MRU) to contrast it with LRU. How does the system know to handle certain blocks MRU? There are several possibilities.  If the operating system interface distinguishes between random-access files and sequential files, it is easy. Data blocks of sequential files should be managed MRU. 125  In some systems, all files are alike, but there is a different kind of open call, or a flag passed to open, that indicates whether the file will be accessed randomly or sequentially.  Even if the OS gets no explicit information from the application program, it can watch the pattern of reads an writes. If recent history indicates that all (or most) reads or writes of the file have been sequential, the data blocks should be managed MRU. A similar trick is called read-ahead. If a file is being read sequentially, it is a good idea to read a few blocks at a time. This cuts down on the latency for the application (most of the time the data the application wants is in memory before it even asks for it). If the disk hardware allows multiple blocks to be read at a time, it can cut the number of disk read requests, cutting down on overhead such as the time to service a I/O completion interrupt. If the system has done a good job of clustering together the disks of the file, read-ahead also takes better advantage of the clustering. If the system reads one block at a time, another process, accessing a different file, could make the disk head move away from the area containing the blocks of this file between accesses. The Berkeley file system introduced another trick to improve file system performance. They divided the disk into chunks, which they called cylinder groups (CGs) because each one is comprised of some number of adjacent cylinders. Each CG is like a miniature disk. It has its own super block and array of inodes. The system attempts to put all the blocks of a file in the same CG as its inode. It also tries to keep all the inodes in one directory together in the same CG so that operations like ls -l * will be fast. It uses a variety to techniques to assign inodes and blocks to CGs in such as way as to distribute the free space fairly evenly between them, so there will be enough room to do this clustering. In particular,  When a new file is created, its inode is placed in the same CG as its parent directory (if possible). But when a new directory is created, its inode is placed in CG with the largest amount of free space (so that the files in the directory will be able to be near each other).  When blocks are added to a file, they are allocated (if possible) from the same CG that contains it inode. But when the size of the file crosses certain thresholds (say every megabyte or so), the system switches to a different CG, one that is relatively empty. The 126 idea is to prevent a big file from hogging all the space in one CG and preventing other files in the CG from being well clustered. This Java declaration is actually a bit of a lie. In Java, an instance of class Dirent would include some header information indicating that it was a Dirent object, a two-byte short integer, and a pointer to an array object (which contains information about its type an length, in addition to the 14 bytes of data). The actual representation is given by the C (or C++) declaration struct direct { unsigned short int inumber; char name[14]; } Unfortunately, there's no way to represent this in Java. This is also a lie for the fact that the field byte name[], which is intended to indicate an array of indeterminant length, rather than a pointer to an array. The actual C declaration is struct dirent { unsigned long int inumber; unsigned short int reclen; unsigned short int reclen; char name[256]; } The array size 256 is a lie. The code depends on the fact that the C language does not do any array bounds checking. Dictionary defines epoch as an instant of time or a date selected as a point of reference in astronomy. Critical sections are usually implemented so that they actually occur one after the other, but all that is required is that they behave as if they were serialized. For example, if neither transaction modifies anything, or if they don't touch any overlapping data, they can be run concurrently without any harm. Database implementations of transactions go to a great deal of trouble to allow as much concurrency as possible. 7.6 Exercises 1. Researchers have suggested that, instead of having an access list associated with each file (specifying which users can access the file, and how), we should have a user control list associated with each user (specifying which files a user can access, and how). Discuss the relative merits of these two schemes. Answer: 127 File control list. Since the access control information is concentrated in one single place, it is easier to change access control information and this requires less space. User control list. This requires less overhead when opening a file. 2. Consider a file currently consisting of 100 blocks. Assume that the file control block (and the index block, in the case of indexed allocation) is already in memory. Calculate how many disk I/O operations are required for contiguous, linked, and indexed (single-level) allocation strategies, if, for one block, the following conditions hold. In the contiguous-allocation case, assume that there is no room to grow in the beginning, but there is room to grow in the end. Assume that the block information to be added is stored in memory. (a) The block is added at the beginning. (d)The block is removed from the beginning. (b) The block is added in the middle. (e) The block is removed from the middle. (c) The block is added at the end. (f) The block is removed from the end. Answer: Contiguous Linked Indexed (a) 201 1 1 (b) 101 52 1 (c) 1 3 1 (d) 198 1 0 (e) 98 52 0 (f) 0 100 0 3. What problems could occur if a system allowed a file system to be mounted simultaneously at more than one location? Answer: There would be multiple paths to the same file, which could confuse users or encourage mistakes (deleting a file with one path deletes the file in all the other paths). 4. Why must the bitmap for file allocation be kept on mass storage, rather than in main memory? Answer: In case of system crash (memory failure) the free-space list would not be lost as it would be if the bit map had been stored in main memory. 5. Consider a system that supports the strategies of contiguous, linked, and indexed allocation. What criteria should be used in deciding which strategy is best utilized for a particular file? Answer: 128 • Contiguous - If file is usually accessed sequentially, if file is relatively small. • Linked - if file is large and usually accessed sequentially. • Indexed - if file is large and usually accessed randomly 129 130 8. Protection and Security The terms protection and security are often used together, and the distinction between them is a bit blurred, but security is generally used in a broad sense to refer to all concerns about controlled access to facilities, while protection describes specific technological mechanisms that support security. 8.1 User Security As in any other area of software design, it is important to distinguish between policies and mechanisms. Before you can start building machinery to enforce policies, you need to establish what policies you are trying to enforce. Many years ago, there was a story about a software firm that was hired by a small savings and loan corporation to build a financial accounting system. The chief financial officer used the system to embezzle millions of dollars and fled the country. The losses were so great the S&L went bankrupt, and the loss of the contract was so bad the software company also went belly-up. Did the accounting system have a good or bad security design? The problem wasn't unauthorized access to information, but rather authorization to the wrong person. The situation is analogous to the old saw that every program is correct according to some specification. Unfortunately, we don't have the space here to go into the whole question of security policies here. We will just assume that terms like `àuthorized access'' have some well-defined meaning in a particular context. Threats Any discussion of security must begin with a discussion of threats. After all, if you don't know what you're afraid of, how are you going to defend against it? Threats are generally divided in three main categories.  Unauthorized disclosure. A ``bad guy'' gets to see information he has no right to see (according to some policy that defines ``bad guy'' and ``right to see'').  Unauthorized updates. The bad guy makes changes he has no right to change.  Denial of service. The bad guy interferes with legitimate access by other users. There is a wide spectrum of denial-of-service threats. At one end, it overlaps with the previous category. A bad guy deleting a good guy's file could be considered an unauthorized update. A the other end of the spectrum, blowing up a computer with a hand grenade is not usually considered an unauthorized update. As this second example illustrates, some denial-of-service threats can 131 only be enforced by physical security. No matter how well your OS is designed, it can't protect my files from his hand grenade. Another form of denial-of-service threat comes from unauthorized consumption of resources, such as filling up the disk, tying up the CPU with an infinite loop, or crashing the system by triggering some bug in the OS. While there are software defenses against these threats, they are generally considered in the context of other parts of the OS rather than security and protection. In short, discussion of software mechanisms for computer security generally focuses on the first two threats. In response to these threats counter measures also fall into various categories. As programmers, we tend to think of technological tricks, but it is also important to realize that a complete security design must involve physical components (such as locking the computer in a secure building with armed guards outside) and human components (such as a background check to make sure your CFO isn't a crook, or checking to make sure those armed guards aren't taking bribes). The Trojan Horse Break-in techniques come in numerous forms. One general category of attack that comes in a great variety of disguises is the Trojan Horse scam. The name comes from Greek mythology. The ancient Greeks were attacking the city of Troy, which was surrounded by an impenetrable wall. Unable to get in, they left a huge wooden horse outside the gates as a ``gift'' and pretended to sail away. The Trojans brought the horse into the city, where they discovered that the horse was filled with Greek soldiers who defeated the Trojans to win the Rose Bowl (oops, wrong story). In software, a Trojan Horse is a program that does something useful--or at least appears to do something useful--but also subverts security somehow. In the personal computer world, Trojan horses are often computer games infected with ``viruses.'' Here's the simplest Trojan Horse program that log onto a public terminal and start a program that does something like this: print("login:"); name = readALine(); turnOffEchoing(); print("password:"); passwd = readALine(); sendMail("badguy",name,passwd); print("login incorrect"); exit(); A user waking up to the terminal will think it is idle. He will attempt to log in, typing his login name and password. The Trojan Horse program sends this information to the bad guy, prints the 132 message login incorrect and exits. After the program exits, the system will generate a legitimate login: message and the user, thinking he mistyped his password (a common occurrence because the password is not echoed) will try again, log in successfully, and have no suspicion that anything was wrong. Note that the Trojan Horse program doesn't actually have to do anything useful, it just has to appear to. Design Principles  Public Design. A common mistake is to try to keep a system secure by keeping its algorithms secret. That's a bad idea for many reasons. First, it gives a kind of all-ornothing security. As soon as anybody learns about the algorithm, security is all gone. In the words of Benjamin Franklin, ``Two people can keep a secret if one of them is dead.'' Second, it is usually not that hard to figure out the algorithm, by seeing how the system responds to various inputs, decompiling the code, etc. Third, publishing the algorithm can have beneficial effects. The bad guys probably have already figured out your algorithm and found its weak points. If you publish it, perhaps some good guys will notice bugs or loopholes and tell you about them so you can fix them.  Default is No Access. Start out by granting as little access possible and adding privileges only as needed. If you forget to grant access where it is legitimately needed, you'll soon find out about it. Users seldom complain about having too much access.  Timely Checks. Checks tend to ``wear out.'' For example, the longer you use the same password, the higher the likelihood it will be stolen or deciphered. Be careful: This principle can be overdone. Systems that force users to change passwords frequently encourage them to use particularly bad ones. A system that forced users to supply a password every time they wanted to open a file would inspire all sorts of ingenious ways to avoid the protection mechanism altogether.  Minimum Privilege. This is an extension of point 2. A person (or program or process) should be given just enough powers to get the job done. In other contexts, this principle is called ``need to know.'' It implies that the protection mechanism has to support finegrained control.  Simple, Uniform Mechanisms. Any piece of software should be as simple as possible (but no simpler!) to maximize the chances that it is correctly and efficiently implemented. 133 This is particularly important for protection software, since bugs are likely be usable as security loopholes. It is also important that the interface to the protection mechanisms be simple, easy to understand, and easy to use. It is remarkably hard to design good, foolproof security policies; policy designers need all the help they can get.  Appropriate Levels of Security. You don't store your best silverware in a box on the front lawn, but you also don't keep it in a vault at the bank. The US Strategic Air Defense calls for a different level of security than my records of the grades for this course. Not only does excessive security mechanism add unnecessary cost and performance degradation, it can actually lead to a less secure system. If the protection mechanisms are too hard to use, users will go out of their way to avoid using them. Authentication Authentication is a process by which one party convinces another of its identity. A familiar instance is the login process, though which a human user convinces the computer system that he has the right to use a particular account. If the login is successful, the system creates a process and associates with it the internal identifier that identifies the account. Authentication occurs in other contexts, and it isn't always a human being that is being authenticated. Sometimes a process needs to authenticate itself to another process. In a networking environment, a computer may need to authenticate itself to another computer. In general, let's call the party that whats to be authenticated the client and the other party the server. One common technique for authentication is the use of a password. This is the technique used most often for login. There is a value, called the password that is known to both the server and to legitimate clients. The client tells the server who he claims to be and supplies the password as proof. The server compares the supplied password with what he knows to be the true password for that user. Although this is a common technique, it is not a very good one. There are lots of things wrong with it. Direct attacks on the password. The most obvious way of breaking in is a frontal assault on the password. Simply try all possible passwords until one works. The main defense against this attack is the time it takes to try lots of possibilities. If the client is a computer program (perhaps masquerading as a human being), it can 134 try lots of combinations very quickly, but by if the password is long enough, even the fastest computer cannot try succeed in a reasonable amount of time. If the password is a string of 8 letters and digits, there are 2,821,109,907,456 possibilities. A program that tried one combination every millisecond would take 89 years to get through them all. If users are allowed to pick their own passwords, they are likely to choose ``cute doggie names'', common words, names of family members, etc. That cuts down the search space considerably. A password cracker can go through dictionaries, lists of common names, etc. It can also use biographical information about the user to narrow the search space. There are several defenses against this sort of attack.  The system chooses the password. The problem with this is that the password will not be easy to remember, so the user will be tempted to write it down or store it in a file, making it easy to steal. This is not a problem if the client is not a human being.  The system rejects passwords that are too `èasy to guess''. In effect, it runs a password cracker when the user tries to set his password and rejects the password if the cracker succeeds. This has many of the disadvantages of the previous point. Besides, it leads to a sort of arms race between crackers and checkers.  The password check is artificially slowed down, so that it takes longer to go through lots of possibilities. One variant of this idea is to hang up a dial-in connection after three unsuccessful login attempts, forcing the bad guy to take the time to redial. Eavesdropping. This is a far bigger program for passwords than brute force attacks. In comes in many disguises.  Looking over someone's shoulder while he's typing his password. Most systems turn off echoing, or echo each character as an asterisk to mitigate this problem.  Reading the password file. In order to verify that the password is correct, the server has to have it stored somewhere. If the bad guy can somehow get access to this file, he can pose as anybody. While this isn't a threat on its own (after all, why should the bad guy have access to the password file in the first place?), it can magnify the effects of an existing security lapse. UNIX introduced a clever fix to this problem, that has since been almost universally copied. Use some hash function f and instead of storing password, store f(password). The hash function should have two properties: Like any hash function it should generate all 135 possible result values with roughly equal probability, and in addition, it should be very hard to invert--that is, given f(password), it should be hard to recover password. It is quite easy to devise functions with these properties. When a client sends his password, the server applies f to it and compares the result with the value stored in the password file. Since only f(password) is stored in the password file, nobody can find out the password for a given user, even with full access to the password file, and logging in requires knowing password, not f(password). In fact, this technique is so secure, it has become customary to make the password file publicly readable!  Wire tapping. If the bad guy can somehow intercept the information sent from the client to the server, password-based authentication breaks down altogether. It is increasingly the case the authentication occurs over an insecure channel such as a dial-up line or a localarea network. Note that the UNIX scheme of storing f(password) is of no help here, since the password is sent in its original form (``plaintext'' in the jargon of encryption) from the client to the server. We will consider this problem in more detail below. Spoofing. This is the worst threat of all. How does the client know that the server is who it appears to be? If the bad guy can pose as the server, he can trick the client into divulging his password. We saw a form of this attack above. It would seem that the server needs to authenticate itself to the client before the client can authenticate itself to the server. Clearly, there's a chicken-and-egg problem here. Fortunately, there's a very clever and general solution to this problem. Challenge-response. There are wide variety of authentication protocols, but they are all based on a simple idea. As before, we assume that there is a password known to both the (true) client and the (true) server. Authentication is a four-step process.  The client sends a message to the server saying who he claims to be and requesting authentication.  The server sends a challenge to the client consisting of some random value x.  The client computes g(password,x) and sends it back as the response. Here g is a hash function similar to the function f above, except that it has two arguments. It should have 136 the property that it is essentially impossible to figure out password even if you know both x and g(password,x).  The server also computes g(password,x) and compares it with the response it got from the client. Clearly this algorithm works if both the client and server are legitimate. An eavesdropper could learn the user's name, x and g(password,x), but that wouldn't help him pose as the user. If he tried to authenticate himself to the server he would get a different challenge x', and would have no way to respond. Even a bogus server is no threat. The change provides him with no useful information. Similarly, a bogus client does no harm to a legitimate server except for tying him up in a useless exchange (a denial-of-service problem!). Protection Mechanisms Before looking at the protection mechanisms, let’s have a look at some terminologies: objects The things to which we wish to control access. They include physical (hardware) objects as well as software objects such as files, databases, semaphores, or processes. As in object-oriented programming, each object has a type and supports certain operations as defined by its type. In simple protection systems, the set of operations is quite limited: read, write, and perhaps execute, append, and a few others. Fancier protection systems support a wider variety of types and operations, perhaps allowing new types and operations to be dynamically defined. principals Intuitively, `ùsers''--the ones who do things to objects. Principals might be individual persons, groups or projects, or roles, such as `àdministrator.'' Often each process is associated with a particular principal, the owner of the process. rights Permissions to invoke operations. Each right is the permission for a particular principal to perform a particular operation on a particular object. For example, principal solomon might have read rights for a particular file object. domains 137 Sets of rights. Domains may overlap. Domains are a form of indirection, making it easier to make wholesale changes to the access environment of a process. There may be three levels of indirection: A principal owns a particular process, which is in a particular domain, which contains a set of rights, such as the right to modify a particular file. Conceptually, the protection state of a system is defined by an access matrix. The rows correspond to principals (or domains), the columns correspond to objects, and each cell is a set of rights. For example, if access[solomon]["/tmp/foo"] = { read, write }then we have read and write access to file "/tmp/foo". We say ``conceptually'' because the access is never actually stored anywhere. It is very large and has a great deal of redundancy (for example, my rights to a vast number of objects are exactly the same: none!), so there are much more compact ways to represent it. The access information is represented in one of two ways, by columns, which are called access control lists (ACLs), and by rows, called capability lists. 8.2 Access Control Lists An ACL (pronounced `àckle'') is a list of rights associated with an object. A good example of the use of ACLs is the Andrew File System (AFS) originally created at Carnegie-Mellon University and now marketed by Transarc Corporation as an add-on to UNIX. This file system is widely used in the Computer Sciences Department. Your home directory is in AFS. AFS associates an ACL with each directory, but the ACL also defines the rights for all the files in the directory (in effect, they all share the same ACL). You can list the ACL of a directory with the fs listacl command: % fs listacl /u/c/s/cs537-1/public Access list for /u/c/s/cs537-1/public is Normal rights: system:administrators rlidwka system:anyuser rl solomon rlidwka The entry system:anyuser rl means that the principal system:anyuser (which represents the role `ànybody at all'') has rights r (read files in the directory) and l (list the files in the directory and read their attributes). The entry solomon rlidwka means that we have all seven rights supported by AFS. In addition to r and l, they include the rights to insert new file in the the directory (i.e., create files), delete files, write files, lock files, and administer the ACL list itself. This last right is very powerful: It allows me to add, delete, or modify ACL entries. We thus have the power to grant or deny any rights to this directory to anybody. The remaining entry in the list shows that the principal system administrators has the same rights we do namely, all rights. This principal is 138 the name of a group of other principals. The command pts membership system administrators list the members of the group. Ordinary UNIX also uses an ACL scheme to control access to files, but in a much stripped-down form. Each process is associated with a user identifier (uid) and a group identifier (gid), each of which is a 16-bit unsigned integer. The inode of each file also contains a uid and a gid, as well as a nine-bit protection mask, called the mode of the file. The mask is composed of three groups of three bits. The first group indicates the rights of the owner: one bit each for read access, write access, and execute access (the right to run the file as a program). The second group similarly lists the rights of the file's group, and the remaining three three bits indicate the rights of everybody else. For example, the mode 111 101 101 (0755 in octal) means that the owner can read, write, and execute the file, while members of the owning group and others can read and execute, but not write the file. Programs that print the mode usually use the characters rwxrather than 0 and 1. Each zero in the binary value is represented by a dash, and each 1 is represented by r, w, or x, depending on its position. For example, the mode 111101101 is printed as rwxr-xr-x. In somewhat more detail, the access-checking algorithm is as follows: The first three bits are checked to determine whether an operation is allowed if the uid of the file matches the uid of the process trying to access it. Otherwise, if the gid of the file matches the gid of the process, the second three bits are checked. If neither of the id's match, the last three bits are used. The code might look something like this. boolean accessOK(Process p, Inode i, int operation) { int mode; if (p.uid == i.uid) mode = i.mode >> 6; else if (p.gid == i.gid) mode = i.mode >> 3; else mode = i.mode; switch (operation) { case READ: mode &= 4; break; case WRITE: mode &= 2; break; case EXECUTE: mode &= 1; break; } return (mode != 0); } (The expression i.mode >> 3 denotes the value i.mode shifted right by three bits positions and the operation mode &= 4 clears all but the third bit from the right of mode.) Note that this scheme can actually give a random user more powers over the file than its owner. For example, 139 the mode ---r--rw- (000 100 110 in binary) means that the owner cannot access the file at all, while members of the group can only read the file, and other can both read and write. On the other hand, the owner of the file (and only the owner) can execute the chmod system call, which changes the mode bits to any desired value. When a new file is created, it gets the uid and gid of the process that created it, and a mode supplied as an argument to the creat system call. Most modern versions of UNIX actually implement a slightly more flexible scheme for groups. A process has a set of gid's, and the check to see whether the file is in the process' group checks to see whether any of the process' gid's match the file's gid. boolean accessOK(Process p, Inode i, int operation) { int mode; if (p.uid == i.uid) mode = i.mode >> 6; else if (p.gidSet.contains(i.gid)) mode = i.mode >> 3; else mode = i.mode; switch (operation) { case READ: mode &= 4; break; case WRITE: mode &= 2; break; case EXECUTE: mode &= 1; break; } return (mode != 0); } When a new file is created, it gets the uid of the process that created it and the gid of the containing directory. There are system calls to change the uid or gid of a file. For obvious security reasons, these operations are highly restricted. Some versions of UNIX only allow the owner of the file to change it gid, only allow him to change it to one of his gid's, and don't allow him to change the uid at all. For directories, `èxecute'' permission is interpreted as the right to get the attributes of files in the directory. Write permission is required to create or delete files in the directory. This rule leads to the surprising result that you might not have permission to modify a file, yet be able to delete it and replace it with another file of the same name but with different contents! UNIX has another very clever feature--so clever that it is patented! The file mode actually has a few more bits that we have not mentioned. One of them is the so-called setuid bit. If a process executes a program stored in a file with the setuid bit set, the uid of the process is set equal to the uid of the file. This rather curious rule turns out to be a very powerful feature, allowing the simple rwx permissions directly supported by UNIX to be used to define arbitrarily complicated protection policies. 140 As an example, suppose you wanted to implement a mail system that works by putting all mail messages in to one big file, say /usr/spool/mbox. We should be able to read only those message that mention me in the To: or Cc: fields of the header. Here's how to use the setuid feature to implement this policy. Define a new uid mail, make it the owner of /usr/spool/mbox, and set the mode of the file to rw------- (i.e., the owner mail can read and write the file, but nobody else has any access to it). Write a program for reading mail, say /usr/bin/readmail. This file is also owned by mail and has mode srwxr-xr-x. The `s' means that the setuid bit is set. My process can execute this program (because the `èxecute by anybody'' bit is on), and when it does, it suddenly changes its uid to mail so that it has complete access to /usr/spool/mbox. At first glance, it would seem that letting my process pretend to be owned by another user would be a big security hole, but it isn't, because processes don't have free will. They can only do what the program tells them to do. While my process is running readmail, it is following instructions written by the designer of the mail system, so it is safe to let it have access appropriate to the mail system. There's one more feature that helps readmail do its job. A process really has two uid's, called the effective uid and the real uid. When a process executes a setuid program, its effective uid changes to the uid of the program, but its real uid remains unchanged. It is the effective uid that is used to determine what rights it has to what files, but there is a system call to find out the real uid of the current process. Readmail can use this system call to find out what user called it, and then only show the appropriate messages. Capabilities An alternative to ACLs are capabilities. A capability is a ``protected pointer'' to an object. It designates an object and also contains a set of permitted operations on the object. For example, one capability may permit reading from a particular file, while another allows both reading and writing. To perform an operation on an object, a process makes a system call, presenting a capability that points to the object and permits the desired operation. For capabilities to work as a protection mechanism, the system has to ensure that processes cannot mess with their contents. There are three distinct ways to ensure the integrity of a capability. Tagged architecture. Some computers associate a tag bit with each word of memory, marking the word as a capability word or a data word. The hardware checks that capability words are only assigned from other capability words. To create or modify a capability, a process has to make a kernel call. 141 Separate capability segments. If the hardware does not support tagging individual words, the OS can protect capabilities by putting them in a separate segment and using the protection features that control access to segments. 8.3 Cryptography Each capability can be extended with a cryptographic checksum that is computed from the rest of the content of the capability and a secret key. If a process modifies a capability it cannot modify the checksum to match without access to the key. Only the kernel knows the key. Each time a process presents a capability to the kernel to invoke an operation; the kernel checks the checksum to make sure the capability hasn't been tampered with. Capabilities, like segments are a ``good idea'' that somehow seldom seems to be implemented in real systems in full generality. Like segments, capabilities show up in an abbreviated form in many systems. For example, the file descriptor for an open file in UNIX is a kind of capability. When a process tries to open a file for writing, the system checks the file's ACL to see whether the access is permitted. If it is, the process gets a file descriptor for the open file, which is a sort of capability to the file that permits write operations. UNIX uses the separate segment approach to protect the capability. The capability itself is stored in a table in the kernel and the process has only an indirect reference to it (the index of the slot in the table). File descriptors are not fullfledged capabilities, however. For example, they cannot be stored in files, because they go away when the process terminates. 142 8.4 Exercises 1. What are the main differences between capability lists and access lists? Answer: An access list is a list for each object consisting of the domains with a nonempty set of access rights for that object. A capability list is a list of objects and the operations allowed on those objects for each domain. 2. What protection problems may arise if a shared stack is used for parameter passing? Answer: The contents of the stack could be compromised by other process (s) sharing the stack. 3. Consider a computing environment where a unique number is associated with each process and each object in the system. Suppose that we allow a process with number n to access an object with number m only if n > m. What type of protection structure do we have? Answer: We have hierarchical structure. 4. Consider a computing environment where a process is given the privilege of accessing an object for only n times. Suggest a scheme for implementing this policy. Answer: Add an integer counter with the capability. 5. Why is it difficult to protect a system in which users are allowed to do their own I/O? Answer: In earlier chapters we identified a distinction between kernel and user mode where kernel mode is used for carrying out privileged operations such as I/O. One reason why I/O must be performed in kernel mode is that I/O requires accessing the hardware and proper access to the hardware is necessary for system integrity. If we allow users to perform their own I/O, we cannot guarantee system integrity. 6. Capability lists are usually kept within the address space of the user. How does the system ensure that the user cannot modify the contents of the list? Answer: 143 A capability list is considered a “protected object” and is accessed only indirectly by the user. The operating system ensures the user cannot access the capability list directly. 144 Bibliography 1. Andrew S. Tannenbaum, “Modern Operating Systems”, 2nd Edition, Prentice Hall 2. Avi Silberschatz, Peter Baer Galvin, & Greg Gagne, „Operating System Concepts”, Seventh Edition 3. E.W. Dijkstra, Dijkstra Algorithm, 1965 4. Haberman, Execution Complexity, 1969 5. Silberschatz & Galvin , “Operating Systems Concepts”, 6th Edition, Addison-Wesley 6. William Stallings, “Operating Systems”, 4th Edition, Prentice Hall 145

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2. Overview of Operating Systems