Download CS 519 -- Operating Systems -

Document related concepts

Plan 9 from Bell Labs wikipedia , lookup

Distributed operating system wikipedia , lookup

Commodore DOS wikipedia , lookup

RSTS/E wikipedia , lookup

Burroughs MCP wikipedia , lookup

Spring (operating system) wikipedia , lookup

DNIX wikipedia , lookup

Paging wikipedia , lookup

Unix security wikipedia , lookup

CP/M wikipedia , lookup

VS/9 wikipedia , lookup

Transcript
I/O and File Systems
CS 519: Operating System Theory
Computer Science, Rutgers University
Instructor: Thu D. Nguyen
TA: Xiaoyan Li
Spring 2002
I/O Devices
So far we have talked about how to abstract and
manage CPU and memory
Computation “inside” computer is useful only if some
results are communicated “outside” of the computer
I/O devices are the computer’s interface to the
outside world (I/O  Input/Output)
Example devices: display, keyboard, mouse, speakers, network
interface, and disk
Computer Science, Rutgers
2
CS 519: Operating System Theory
Basic Computer Structure
CPU
Memory
Memory Bus
(System Bus)
Bridge
I/O Bus
NIC
Disk
Computer Science, Rutgers
3
CS 519: Operating System Theory
Intel SR440BX Motherboard
CPU
System Bus &
MMU/AGP/PCI
Controller
I/O Bus
IDE Disk
Controller
USB
Controller
Another
I/O Bus
Serial &
Parallel Ports
Computer Science, Rutgers
Keyboard
& Mouse
4
CS 519: Operating System Theory
OS: Abstractions and Access Methods
OS must virtualizes a wide range of devices into a few
simple abstractions:
Storage
Hard drives, Tapes, CDROM
Networking
Ethernet, radio, serial line
Multimedia
DVD, Camera, microphones
Operating system should provide consistent methods to
access the abstractions
Otherwise, programming is too hard
Computer Science, Rutgers
5
CS 519: Operating System Theory
User/OS method interface
The same interface is used to access devices (like disks and
network lines) and more abstract resources like files
4 main methods:
open()
close()
read()
write()
Semantics depend on the type of the device (block, char, net)
These methods are system calls because they are the methods the
OS provides to all processes.
Computer Science, Rutgers
6
CS 519: Operating System Theory
Unix I/O Methods
fileHandle = open(pathName, flags, mode)
a file handle is a small integer, valid only within a single process, to
operate on the device or file
pathname: a name in the file system. In unix, devices are put under
/dev. E.g. /dev/ttya is the first serial port, /dev/sda the first SCSI
drive
flags: blocking or non-blocking …
mode: read only, read/write, append …
errorCode = close(fileHandle)
Kernel will free the data structures associated with the device
Computer Science, Rutgers
7
CS 519: Operating System Theory
Unix I/O Methods
byteCount = read(fileHandle, byte [] buf, count)
read at most count bytes from the device and put them in the
byte buffer buf. Bytes placed from 0th byte.
Kernel can give the process less bytes, user process must
check the byteCount to see how many were actually returned.
A negative byteCount signals an error (value is the error type)
byteCount = write(fileHandle, byte [] buf, count)
write at most count bytes from the buffer buf
actual number written returned in byteCount
a Negative byteCount signals an error
Computer Science, Rutgers
8
CS 519: Operating System Theory
Unix I/O Example
What’s the correct way to write a 1000 bytes?
calling:
ignoreMe = write(fileH, buffer, 1000);
works most of the time. What happens if:
can’t accept 1000 bytes right now?
disk is full?
someone just deleted the file?
Let’s work this out on the board
Computer Science, Rutgers
9
CS 519: Operating System Theory
I/O semantics
From this basic interface, three different dimension to
how I/O is processed:
blocking vs. non-blocking
buffered vs. unbuffered
synchronous vs. asynchronous
The O/S tries to support as many of these dimensions
as possible for each device
The semantics are specified during the open() system
call
Computer Science, Rutgers
10
CS 519: Operating System Theory
Blocking vs. Non-Blocking I/O
Blocking: process is suspended until all bytes in the count field
are read or written
E.g., for a network device, if the user wrote 1000 bytes, then the
operating system would write the bytes to the device one at a time
until the write() completed.
+ Easy to use and understand
- if the device just can’t perform the operation (e.g. you unplug the
cable), what to do? Give up an return the successful number of bytes.
Nonblocking: the OS only reads or writes as many bytes as is
possible without suspending the process
+ Returns quickly
- more work for the programmer (but really for robust programs)?
Computer Science, Rutgers
11
CS 519: Operating System Theory
Buffered vs. Unbuffered I/O
Sometime we want the ease of programming of blocked I/O
without the long waits if the buffers on the device are small.
buffered I/O allows the kernel to make a copy of the data
write() side: allows the process to write() many bytes and continue
processing
read() side: As device signals data is ready, kernel places data in
the buffer. When the process calls read(), the kernel just copies
the buffer.
Why not use buffered I/O?
- Extra copy overhead
- Delays sending data
Computer Science, Rutgers
12
CS 519: Operating System Theory
Synchronous vs. Asynchronous I/O
Synchronous I/O: the user process does not run at the
same time the I/O does --- it is suspended during I/O
So far, all the methods presented have been
synchronous.
Asynchronous I/O: The read() or write() call returns a
small object instead of a count.
separate set of methods in unix: aio_read(), aio_write()
The user can call methods on the returned object to
check “how much” of the I/O has completed
The user can also allow a signal handler to run when the
the I/O has completed.
Computer Science, Rutgers
13
CS 519: Operating System Theory
Handling Multiple I/O streams
If we use blocking I/O, how do we handle > 1 device at a time?
Example : a small program that reads from serial line and outputs to a
tape and is also reading from the tape and outputting to a serial line
Structure the code like this?
while (TRUE) {
read(tape); // block until tape is ready
write(serial line);// send data to serial line
read(serial line);
write(tape);
}
Could use non-blocking I/O, but huge waste of cycles if data is not
ready.
Computer Science, Rutgers
14
CS 519: Operating System Theory
Solution: select system call
totalFds = select(nfds, readfds, writefds, errorfds, timeout);
nfds: the range (0.. nfds) of file descriptors to check
readfds: bit map of fileHandles. user sets the bit X to ask the
kernel to check if fileHandle X ready for reading. Kernel
returns a 1 if data can be read on the fileHandle.
writefds: bit map if fileHandle Y for writing. Operates same as
read bitmap.
errorfds: bit map to check for errors.
timeout: how long to wait before un-suspending the process
totalFds = number of set bits, negative number is an error
Computer Science, Rutgers
15
CS 519: Operating System Theory
Three Device Types
Most operating system have three device types:
Character devices
Used for serial-line types of devices (e.g. USB port)
Network devices
Used for network interfaces (E.g. Ethernet card)
Block devices:
Used for mass-storage (E.g. disks and CDROM)
What you can expect from the read/write methods
changes with each device type
Computer Science, Rutgers
16
CS 519: Operating System Theory
Character Devices
Device is represented by the OS as an ordered stream
of bytes
bytes sent out the device by the write() system call
bytes read from the device by the read() system call
Byte stream has no “start”, just open and start/reading
writing
The user has no control of the read/write ratio, the
sender process might write() 1 time of 1000 bytes, and
the receiver may have to call 1000 read calls each
receiving 1 byte.
Computer Science, Rutgers
17
CS 519: Operating System Theory
Network Devices
Like I/O devices, but each write() call either sends the
entire block (packet), up to some maximum fixed size,
or none.
On the receiver, the read() call returns all the bytes in
the block, or none.
Computer Science, Rutgers
18
CS 519: Operating System Theory
Block Devices
OS presents device as a large array of blocks
Each block has a fixed size (1KB - 8KB is typical)
User can read/write only in fixed sized blocks
Unlike other devices, block devices support random
access
We can read or write anywhere in the device without
having to ‘read all the bytes first’
But how to specify the nth block with current
interface?
Computer Science, Rutgers
19
CS 519: Operating System Theory
The file pointer
O/S adds a concept call the file pointer
A file pointer is associated with each open file, if the
device is a block device
the next read or write operates at the position in the
device pointed to by the file pointer
The file pointer is measured in bytes, not blocks
Computer Science, Rutgers
20
CS 519: Operating System Theory
Seeking in a device
To set the file pointer:
absoluteOffset = lseek(fileHandle, offset, whence);
whence specifies if the offset is absolute, from byte 0,
or relative to the current file pointer position
the absolute offset is returned; negative numbers
signal error codes
For devices, the offset should be a integral number of
bytes relative to the size of a block.
How could you tell what the current position of the file
pointer is without changing it?
Computer Science, Rutgers
21
CS 519: Operating System Theory
Block Device Example
You want to read the 10th block of a disk
each disk block is 2048 bytes long
fh = open(/dev/sda, , , );
pos = lseek(fh, size*count, );
if (pos < 0 ) error;
bytesRead = read(fh, buf, blockSize);
if (bytesRead < 0) error;
…
Computer Science, Rutgers
22
CS 519: Operating System Theory
Getting and setting device specific
information
Unix has an IO ConTroL system call:
ErrorCode = ioctl(fileHandle, int request, object);
request is a numeric command to the device
can also pass an optional, arbitrary object to a device
the meaning of the command and the type of the object
are device specific
Computer Science, Rutgers
23
CS 519: Operating System Theory
Communication Between CPU and
I/O Devices
How does the CPU communicate with I/O devices?
Memory-mapped communication
Each I/O device assigned a portion of the physical address space
CPU  I/O device
CPU writes to locations in this area to "talk" to I/O device
I/O device  CPU
Polling: CPU repeatedly check location(s) in portion of address space
assigned to device
Interrupt: Device sends an interrupt (on an interrupt line) to get the
attention of the CPU
Programmed I/O, Interrupt-Driven, Direct Memory Access
PIO and ID = word at a time
DMA = block at a time
Computer Science, Rutgers
24
CS 519: Operating System Theory
Programmed I/O vs. DMA
Programmed I/O is ok for sending commands, receiving
status, and communication of a small amount of data
Inefficient for large amount of data
Keeps CPU busy during the transfer
Programmed I/O  memory operations  slow
Direct Memory Access
Device read/write directly from/to memory
Memory  device typically initiated from CPU
Device  memory can be initiated by either the device or the
CPU
Computer Science, Rutgers
25
CS 519: Operating System Theory
Programmed I/O vs. DMA
CPU
Memory
Interconnect
CPU
Memory
Interconnect
CPU
Memory
Interconnect
Disk
Disk
Disk
Programmed
I/O
DMA
DMA
Device  Memory
Problems?
Computer Science, Rutgers
26
CS 519: Operating System Theory
Direct Memory Access
Used to avoid programmed I/O for large data
movement
Requires DMA controller
Bypasses CPU to transfer data directly between I/O
device and memory
Computer Science, Rutgers
27
CS 519: Operating System Theory
Six step process to perform DMA
transfer
Computer Science, Rutgers
28
CS 519: Operating System Theory
Performance
I/O a major factor in system performance
Demands CPU to execute device driver, kernel I/O code
Context switches due to interrupts
Data copying
Network traffic especially stressful
Computer Science, Rutgers
29
CS 519: Operating System Theory
Improving Performance
Reduce number of context switches
Reduce data copying
Reduce interrupts by using large transfers, smart
controllers, polling
Use DMA
Balance CPU, memory, bus, and I/O performance for
highest throughput
Computer Science, Rutgers
30
CS 519: Operating System Theory
Device Driver
OS module controlling an I/O device
Hides the device specifics from the above layers in the kernel
Supporting a common API
UNIX: block or character device
Block: device communicates with the CPU/memory in fixed-size blocks
Character/Stream: stream of bytes
Translates logical I/O into device I/O
E.g., logical disk blocks into {head, track, sector}
Performs data buffering and scheduling of I/O operations
Structure
Several synchronous entry points: device initialization, queue I/O requests,
state control, read/write
An asynchronous entry point to handle interrupts
Computer Science, Rutgers
31
CS 519: Operating System Theory
Some Common Entry Points for UNIX
Device Drivers
Attach: attach a new device to the system.
Close: note the device is not in use.
Halt: prepare for system shutdown.
Init: initialize driver globals at load or boot time.
Intr: handle device interrupt (not used).
Ioctl: implement control operations.
Mmap: implement memory-mapping (SVR4).
Open: connect a process to a device.
Read: character-mode input.
Size: return logical size of block device.
Start: initialize driver at load or boot time.
Write: character-mode output.
Computer Science, Rutgers
32
CS 519: Operating System Theory
User to Driver Control Flow
read, write, ioctl
user
kernel
ordinary file
special file
file system
character
device
block
device
buffer cache
character queue
driver_read/write
Computer Science, Rutgers
driver-strategy
33
CS 519: Operating System Theory
Buffer Cache
When an I/O request is made for a block, the buffer
cache is checked first
If block is missing from the cache, it is read into the
buffer cache from the device
Exploits locality of reference as any other cache
Replacement policies similar to those for VM
UNIX
Historically, UNIX has a buffer cache for the disk which does
not share buffers with character/stream devices
Adds overhead in a path that has become increasingly common:
disk  NIC
Computer Science, Rutgers
34
CS 519: Operating System Theory
Disks
Sectors
Tracks
Seek time: time to move the
disk head to the desired
track
Rotational delay: time to
reach desired sector once
head is over the desired
track
Transfer rate: rate data
read/write to disk
Some typical parameters:
Seek: ~10-15ms
Rotational delay: ~4.15ms
for 7200 rpm
Transfer rate: 30 MB/s
Computer Science, Rutgers
35
CS 519: Operating System Theory
Disk Scheduling
Disks are at least four orders of magnitude slower than
main memory
The performance of disk I/O is vital for the performance of
the computer system as a whole
Access time (seek time+ rotational delay) >> transfer time for a
sector
Therefore the order in which sectors are read matters a lot
Disk scheduling
Usually based on the position of the requested sector rather
than according to the process priority
Possibly reorder stream of read/write request to improve
performance
Computer Science, Rutgers
36
CS 519: Operating System Theory
Disk Scheduling Policies
Shortest-service-time-first (SSTF): pick the request that
requires the least movement of the head
SCAN (back and forth over disk): good service distribution
C-SCAN (one way with fast return): lower service variability
Problem with SSTF, SCAN, and C-SCAN: arm may not move for
long time (due to rapid-fire accesses to same track)
N-step SCAN: scan of N records at a time by breaking the request
queue in segments of size at most N and cycling through them
FSCAN: uses two sub-queues, during a scan one queue is consumed
while the other one is produced
Computer Science, Rutgers
37
CS 519: Operating System Theory
Disk Management
Low-level formatting, or physical formatting — Dividing a disk into
sectors that the disk controller can read and write.
To use a disk to hold files, the operating system still needs to
record its own data structures on the disk.
Partition the disk into one or more groups of cylinders.
Logical formatting or “making a file system”.
Boot block initializes system.
The bootstrap is stored in ROM.
Bootstrap loader program.
Methods such as sector sparing used to handle bad blocks.
Computer Science, Rutgers
38
CS 519: Operating System Theory
Swap-Space Management
Swap-space — Virtual memory uses disk space as an
extension of main memory.
Swap-space can be carved out of the normal file
system,or, more commonly, it can be in a separate disk
partition.
Swap-space management
4.3BSD allocates swap space when process starts; holds text
segment (the program) and data segment.
Kernel uses swap maps to track swap-space use.
Solaris 2 allocates swap space only when a page is forced out
of physical memory, not when the virtual memory page is first
created.
Computer Science, Rutgers
39
CS 519: Operating System Theory
Disk Reliability
Several improvements in disk-use techniques involve the
use of multiple disks working cooperatively.
RAID is one important technique currently in common
use.
Computer Science, Rutgers
40
CS 519: Operating System Theory
RAID
Redundant Array of Inexpensive Disks (RAID)
A set of physical disk drives viewed by the OS as a single
logical drive
Replace large-capacity disks with multiple smaller-capacity
drives to improve the I/O performance (at lower price)
Data are distributed across physical drives in a way that
enables simultaneous access to data from multiple drives
Redundant disk capacity is used to compensate for the
increase in the probability of failure due to multiple drives
Improve availability because no single point of failure
Six levels of RAID representing different design
alternatives
Computer Science, Rutgers
41
CS 519: Operating System Theory
RAID Level 0
Does not include redundancy
Data is stripped across the available disks
Total storage space across all disks is divided into strips
Strips are mapped round-robin to consecutive disks
A set of consecutive strips that maps exactly one strip to each disk in the
array is called a stripe
Can you see how this improves the disk I/O bandwidth?
What access pattern gives the best performance?
stripe 0
Computer Science, Rutgers
strip 0
strip 1
strip 2
strip 3
strip 4
...
strip 5
strip 6
strip 7
42
CS 519: Operating System Theory
RAID Level 1
Redundancy achieved by duplicating all the data
Every disk has a mirror disk that stores exactly the same data
A read can be serviced by either of the two disks which contains the
requested data (improved performance over RAID 0 if reads dominate)
A write request must be done on both disks but can be done in parallel
Recovery is simple but cost is high
Computer Science, Rutgers
strip 0
strip 1
strip 1
strip 0
strip 2
...
strip 3
strip 3
strip 2
43
CS 519: Operating System Theory
RAID Levels 2 and 3
Parallel access: all disks participate in every I/O request
Small strips since size of each read/write = # of disks * strip size
RAID 2: error correcting code is calculated across corresponding bits on
each data disk and stored on log(# data disks) parity disks
Hamming code: can correct single-bit errors and detect double-bit errors
Less expensive than RAID 1 but still pretty high overhead – not really needed
in most reasonable environments
RAID 3: a single redundant disk that keeps parity bits
P(i) = X2(i)  X1(i)  X0(i)
In the event of a failure, data can be reconstructed
Can only tolerate a single failure at a time
b0
Computer Science, Rutgers
b1
b2
P(b)
44
X2(i) = P(i)  X1(i)  X0(i)
CS 519: Operating System Theory
RAID Levels 4 and 5
RAID 4
Large strips with a parity strip like RAID 3
Independent access - each disk operates independently, so multiple I/O
request can be satisfied in parallel
Independent access  small write = 2 reads + 2 writes
Example: if write performed only on strip 0:
P’(i) = X2(i)  X1(i)  X0’1(i)
= X2(i)  X1(i)  X0’(i)  X0(i)  X0(i)
= P(i)  X0’(i)  X0(i)
Parity disk can become bottleneck
strip 0
strip 1
strip 2
P(0-2)
strip 3
strip 4
strip 5
P(3-5)
RAID 5
Like RAID 4 but parity strips are distributed across all disks
Computer Science, Rutgers
45
CS 519: Operating System Theory
File System
File system is an abstraction of the disk
File  Track/sector
To a user process
A file looks like a contiguous block of bytes (Unix)
A file system provides a coherent view of a group of files
A file system provides protection
API: create, open, delete, read, write files
Performance: throughput vs. response time
Reliability: minimize the potential for lost or destroyed data
E.g., RAID could be implemented in the OS as part of the disk device
driver
Computer Science, Rutgers
46
CS 519: Operating System Theory
File API
To read or write, need to open
open() returns a handle to the opened file
OS associates a data structure with the handle
Data structure maintains current “cursor” position in the
stream of bytes in the file
Read and write takes place from the current position
Can specify a different location explicitly
When done, should close the file
Computer Science, Rutgers
47
CS 519: Operating System Theory
Files vs. Disk
Disk
Files
???
Computer Science, Rutgers
48
CS 519: Operating System Theory
Files vs. Disk
Disk
Files
Contiguous
Layout
What’s the problem with this mapping function?
What’s the potential benefit of this mapping function?
Computer Science, Rutgers
49
CS 519: Operating System Theory
Files vs. Disk
Disk
Files
What’s the problem with this mapping function?
Computer Science, Rutgers
50
CS 519: Operating System Theory
UNIX File
i-nodes
Computer Science, Rutgers
51
CS 519: Operating System Theory
Defragmentation
Want link-based organization of disk blocks of a file
for efficient random access
Want sequential layout of disk blocks for efficient
sequential access
How to reconcile?
Computer Science, Rutgers
52
CS 519: Operating System Theory
Defragmentation (cont’d)
Base structure is link-based
Optimize for sequential access
Defragmentation: move the blocks around to simulate actual
sequential layout of files
Group allocation of blocks: group tracks together (cylinder).
Try to allocate all blocks of a file from a single cylinder so that
they are close together
Computer Science, Rutgers
53
CS 519: Operating System Theory
Free Space Management
No policy issues here – just mechanism
Bitmap: one bit for each block on the disk
Good to find a contiguous group of free blocks
Files are often accessed sequentially
Small enough to be kept in memory
Chained free portions: pointer to the next one
Not so good for sequential access
Index: treats free space as a file
Computer Science, Rutgers
54
CS 519: Operating System Theory
File System
OK, we have files
How can we name them?
How can we organize them?
Computer Science, Rutgers
55
CS 519: Operating System Theory
Tree-Structured Directories
Computer Science, Rutgers
56
CS 519: Operating System Theory
File Naming
Each file has an associated human-readable name
e.g., usr, bin, mid-term.pdf, design.pdf
File name must be globally unique
Otherwise how would the system know which file we are
referring to?
OS must maintain a mapping between file names and the
set of blocks belong to that file
Mappings are kept in directories
Computer Science, Rutgers
57
CS 519: Operating System Theory
Acyclic-Graph Directories
Have shared subdirectories and files
Computer Science, Rutgers
58
CS 519: Operating System Theory
General Graph Directory
Computer Science, Rutgers
59
CS 519: Operating System Theory
Unix File System
Ordinary files (uninterpreted)
Directories
File of files
Organized as a rooted tree
Pathnames (relative and absolute)
Contains links to parent, itself
Multiple links to files can exist
Link - hard OR symbolic
Computer Science, Rutgers
60
CS 519: Operating System Theory
Unix File Systems (Cont’d)
Tree-structured file
hierarchies
Mounted on existing space by
using mount
No links between different file
systems
Computer Science, Rutgers
61
CS 519: Operating System Theory
File Naming
Each file has a unique name
User visible (external) name must be symbolic
In a hierarchical file system, unique external names are given as
pathnames (path from the root to the file)
Internal names: i-node in UNIX - an index into an array of file
descriptors/headers for a volume
Directory: translation from external to internal name
May have more than one external name for a single internal name
Information about file is split between the directory and the file
descriptor: name, type, size, location on disk, owner, permissions,
date created, date last modified, date last access, link count
Computer Science, Rutgers
62
CS 519: Operating System Theory
Name Space
In UNIX, “devices are files”
/
E.g., /dev/cdrom, /dev/tape
User process accesses devices
by accessing corresponding file
usr
C
Computer Science, Rutgers
63
A
B
D
CS 519: Operating System Theory
File System Buffer Cache
application:
OS:
read/write files
translate file to disk blocks
...buffer cache ...
maintains
controls disk accesses: read/write blocks
hardware:
Any problems?
Computer Science, Rutgers
64
CS 519: Operating System Theory
File System Buffer Cache
Disks are “stable” while memory is volatile
What happens if you buffer a write and the machine crashes
before the write has been saved to disk?
Can use write-through but write performance will suffer
In UNIX
Use un-buffered I/O when writing i-nodes or pointer blocks
Use buffered I/O for other writes and force sync every 30 seconds
What about replacement?
How can we further improve performance?
Computer Science, Rutgers
65
CS 519: Operating System Theory
Application-controlled caching
application:
OS:
read/write files
replacement policy
translate file to disk blocks
...buffer cache ...
maintains
controls disk accesses: read/write blocks
hardware:
Computer Science, Rutgers
66
CS 519: Operating System Theory
Application-Controlled File Caching
Two-level block replacement: responsibility is split
between kernel and user level
A global allocation policy performed by the kernel which
decides which process will give up a block
A block replacement policy decided by the user:
Kernel provides the candidate block as a hint to the process
The process can overrule the kernel’s choice by suggesting an
alternative block
The suggested block is replaced by the kernel
Examples of alternative replacement policy: mostrecently used (MRU)
Computer Science, Rutgers
67
CS 519: Operating System Theory
Sound kernel-user cooperation
Oblivious processes should do no worse than under LRU
Foolish processes should not hurt other processes
Smart processes should perform better than LRU
whenever possible and they should never perform worse
If kernel selects block A and user chooses B instead, the
kernel swaps the position of A and B in the LRU list and places
B in a “placeholder” which points to A (kernel’s choice)
If the user process misses on B (i.e. it made a bad choice), and
B is found in the placeholder, then the block pointed to by the
placeholder is chosen (prevents hurting other processes)
Computer Science, Rutgers
68
CS 519: Operating System Theory
File System Consistency
File system almost always uses a buffer/disk cache for
performance reasons
Two copies of a disk block (buffer cache, disk)  consistency
problem if the system crashes before all the modified blocks are
written back to disk
This problem is critical especially for the blocks that contain
control information: i-node, free-list, directory blocks
Utility programs for checking block and directory consistency
Write critical blocks from the buffer cache to disk immediately
Data blocks are written to disk periodically: sync
Computer Science, Rutgers
69
CS 519: Operating System Theory
More on File System Consistency
To maintain file system consistency the ordering of
updates from buffer cache to disk is critical
Example: if the directory block (contains pointer to inode) is written back before the i-node and the system
crashes, the directory structure will be inconsistent
Similar case when free list is updated before i-node and
the system crashes, free list will be incorrect
A more elaborate solution: use dependencies between
blocks containing control data in the buffer cache to
specify the ordering of updates
Computer Science, Rutgers
70
CS 519: Operating System Theory
Protection Mechanisms
Files are OS objects: unique names and a finite set of operations
that processes can perform on them
Protection domain is a set of {object,rights} where right is the
permission to perform one of the operations
At every instant in time, each process runs in some protection
domain
In Unix, a protection domain is {uid, gid}
Protection domain in Unix is switched when running a program with
SETUID/SETGID set or when the process enters the kernel mode
by issuing a system call
How to store all the protection domains?
Computer Science, Rutgers
71
CS 519: Operating System Theory
Protection Mechanisms (cont’d)
Access Control List (ACL): associate with each object a
list of all the protection domains that may access the
object and how
In Unix ACL is reduced to three protection domains: owner,
group and others
Capability List (C-list): associate with each process a
list of objects that may be accessed along with the
operations
C-list implementation issues: where/how to store them
(hardware, kernel, encrypted in user space) and how to revoke
them
Computer Science, Rutgers
72
CS 519: Operating System Theory
Log-Structured File System (LFS)
As memory gets larger, buffer cache size increases 
increase the fraction of read requests which can be
satisfied from the buffer cache with no disk access
In the future, most disk accesses will be writes
but writes are usually done in small chunks in most file
systems (control data, for instance) which makes the
file system highly inefficient
LFS idea: structure the entire disk as a log
Periodically, or when required, all the pending writes
being buffered in memory are collected and written as
a single contiguous segment at the end of the log
Computer Science, Rutgers
73
CS 519: Operating System Theory
LFS segment
Contain i-nodes, directory blocks and data blocks, all
mixed together
Each segment starts with a segment summary
Segment size: 512 KB - 1MB
Two key issues:
How to retrieve information from the log?
How to manage the free space on disk?
Computer Science, Rutgers
74
CS 519: Operating System Theory
File Location in LFS
The i-node contains the disk addresses of the file block
as in standard UNIX
But there is no fixed location for the i-node
An i-node map is used to maintain the current location
of each i-node
i-node map blocks can also be scattered but a fixed
checkpoint region on the disk identifies the location of
all the i-node map blocks
Usually i-node map blocks are cached in main memory
most of the time, thus disk accesses for them are rare
Computer Science, Rutgers
75
CS 519: Operating System Theory
Segment Cleaning in LFS
LFS disk is divided into segments that are written sequentially
Live data must be copied out of a segment before the segment can
be re-written
The process of copying data out of a segment: cleaning
A separate cleaner thread moves along the log, removes old
segments from the end and puts live data into memory for
rewriting in the next segment
As a result a LFS disk appears like a big circular buffer with the
writer thread adding new segments to the front and the cleaner
thread removing old segments from the end
Bookkeeping is not trivial: i-node must be updated when blocks are
moved to the current segment
Computer Science, Rutgers
76
CS 519: Operating System Theory
LFS Performance
Computer Science, Rutgers
77
CS 519: Operating System Theory
LFS Performance (Cont’d)
Computer Science, Rutgers
78
CS 519: Operating System Theory
Elephant File System
How many time have you deleted something and then
wish you did not?
What about wishing you could find an old version?
Disks are getting big enough and cheap enough to do
something about these problems
Elephant maintains files’ histories
Nice clean idea
Names can be extended with time
Look for version close to that time
Computer Science, Rutgers
79
CS 519: Operating System Theory
What Histories Are Used For
Undo
Ability to undo a recent change
Complete history be retained but for a limited period of time
e.g., hour, day, or week
Long term history
Users may want to recover old versions
Typically think in terms of landmark versions ala source control
Group changes so don’t require as much space
Computer Science, Rutgers
80
CS 519: Operating System Theory
Retention Policies
Keep One
Keep All
Keep Safe
Basically undo for some period of time
Keep Landmarks
Retain only landmark versions
User can specify any version as a landmark
Heuristic for grouping changes into landmark versions
Past a certain time, all changes made within some amount of time
(e.g., several minutes) are grouped into one landmark version
Keep track of when versions are merged
Computer Science, Rutgers
81
CS 519: Operating System Theory
Implementation
Implementation fairly straightforward
One interesting point is cleaner needs to lock inode log
while cleaning
This prevents writes while the cleaner is cleaning a file
Must be fast (while holding the lock, anyways)
Different than LFS cleaner since does not need to coalesce
free space
Supports application guided retention
Performs virtually the same as FFS
Better for large copy
Computer Science, Rutgers
82
CS 519: Operating System Theory
File System Profile
Keep One: 33.6% of files – 56.3% of bytes
Keep Safe: 3.9% of files – 28.5% of bytes
Keep Landmarks: 62.4% of files – 15.2% of bytes
Computer Science, Rutgers
83
CS 519: Operating System Theory