Download DATA - Tistory

Document related concepts
no text concepts found
Transcript
UNIX 내부 구조
(LINUX Kernel을 중심으로)
Contents
Part I. UNIX Operating System
1. Introduction
2. Process Management
3. Memory Management
4. File System
5. Synchronization & IPC
6. I/O System (Device Driver)
Part II. Detailed Study: LINUX Kernel Internals
1. Where is everything?


System call Implementation
Device Driver using Module Programming
2. Linux internals
2
References

U. Vahalia, “Unix Internals, The New Frontiers”, Prentice Hall, 1996.

H. M. Deitel, “Operating Systems”, 2nd edition, Addison-Wesley, 1990
Silberschatz and Galvin, “Operating System Concepts (5th edition)”, AddisonWesley, 1998
Mukesh Singhal and Niranjan G. Shivaratri, “Advanced Concepts in Operating
Systems”, McGraw-Hill, 1994.






Maurice J. Bach, “The Design of the UNIX Operating System”, Prentice Hall,
1986.
M. Beck, etc, “Linux Kernel Internals, 2nd Ed”, Addison-Wesley, 1997
Marshall K. McKusick, K. Bostic, M. Karels and J. Quarterman, “The Design
and Implementation of the 4.4 BSD Operating System”, Addison-Weseley Pub.
Co., 1996.
Benry Goodheart and James Cox, “The Magic Garden Explained”, Prentice Hall,
1994.
3
I. Introduction




What is UNIX Operating System?
Brief History
Kernel Architecture
Features of UNIX Operating System
4
What is UNIX Operating System?
X window
csh
vi
du
who
kernel
Network
Admin.
Package
wc
telnet
Hardware
ps
grep
a.out
sort
gcc
ls
 What’s the similarity between Onion and UNIX?
5
RDBMS
What is UNIX Operating System? (Cont`)
User Programs
User Programs
Trap
User level
Libraries
Kernel level
System Call Interface
File System Management
Process
Management
Buffer Cache
IPC
Context
Device Drivers
Memory Management
Hardware Control (Interrupts handling, etc)
HW level
Hardware
(Source : The design of the UNIX OS)
6
What is UNIX Operating System? (Cont`)

UNIX Operating System is a Resource Manager
 Physical Resource

CPU, Memory, Disk, Network…
 Abstract Resource


process, thread, page, file, inode, message, security, …
UNIX Operating System is the Computing Environments
 provide resources’ service to users
 system call, API
 abstraction is just a set of data structure in kernel level
7
Brief History

Before UNIX
 Multics: 1965, AT&T (Bell Lab), General Electronic, MIT

Epoch
 1969, Ken Thompson, “Space Travel” on PDP-7
 Dennis Ritche
 s5fs, ed, shell (Bourn shell의 조상)
 1973년 “The UNIX Time Sharing System” in CACM

BSD
 Billy Joy, Chuch Haley (대학원생)
 ex, csh, paging based virtual memory system, TCP/IP, ffs, socket
 1993년 4.4BSD (final version, 이후 BSDI 회사 )

AT&T System V
 Version 1,2,…,7, System III, System V, … SVR4.2/ESMP
 region based virtual memory, IPC, remote file sharing, STREAM,
8
Brief History (Cont`)

Commercial UNIX
 XENIX (MS, SCO), SCO UNIX (SCO), AIX (IBM, Journaling FS),
HP-UX (HP), ULTRIX (DEC, 최초의 MP), OSF/1 (Digital), ….
 SunOS (Sun Microsystems, VFS, NFS), Solaris, Unixware
(Novell)

Mach
 최초의 micro-kernel
 chorus, Exo-kernel, SPIN, L4, ….
 http://ssrnet.snu.ac.kr/~choijm/current_os.html

standard
 SVID(System V Interface Definition), POSIX (IEEE), X/OPEN (Inc.)
 UI (SUN, AT&T : Solaris), OSF (OSF/1)

Linux
 Performance oriented
 Philosophy of COPYLEFT
9
Kernel Architecture

Monolithic Kernel
 traditional UNIX, SVR4, Solaris, Linux, ….
process
process
process
System Call
OS Functionality
Integrated Kernel
OS Personality
Hardware
10
Kernel Architecture (Cont`)

Monolithic Kernel
process
process
read()
fork()
System Call
sys_read()
sys_fork()
File System
bread()
Buffer Cache
Process Management
copy_mm()
OS Personality
hd_request()
Disk Device Driver
Memory Manager
copy_thread()
do_hd_io()
Hardware
11
CPU
Kernel Architecture (Cont`)

Micro-Kernel
 Mach, Chorus, L3/L4, SPIN, QNX, Window-NT …
process
Server
Server
System Call
Microkernel
Hardware
12
Server
OS Functionality
Kernel Architecture (Cont`)

Micro-Kernel
process
read()
File System
Server
Process
Server
System Call
sys_read()
hd_request()
Microkernel
Hardware
 what is the advantage of micro kernel ?
13
….
Windows-NT Architecture

Windows-NT
Applications
OS/2
Client
Logon
Process
NT Executive
POSIX
Server
Win32
Server
Security
Server
Object
Manager
POSIX
Client
OS/2
Server
Message
Protected
Subsystem
(Servers)
Win32
Client
Security
Ref. Monitor
Trap
User mode
System Services
I/O Manager
Kernel mode
Process
Manager
File System
Cache Manager
Device Drivers
LPC
Facility
VM
Mgt.
Network
Drivers
Kernel
Hardware Abstraction Layer(HAL)
HW Control
Hardware
(Source : Inside Windows NT)
14
Features

What is Good about UNIX
 Open system

free
 Small is beautiful philosophy

file: just stream of bytes
 Simple and Coherent

data, device, pipe, socket, memory, process, … can be treated as a single
abstraction (file)
 Portability


high-level language
new paradigm: OO, client-server model, clustering, PDA, MM Server
 True Parallelism

Multitasking (Time Sharing), Multiprogramming, Multiprocessor, MPP
15
Features (Cont`)

What is Wrong with UNIX
 Too many variant

dumping ground
 Not small and simple any more

uncontrolled growth
 Building-block approach

inappropriate for beginner
 Lack of GUI

not now
 Ritche’s words, “It takes a genius to understand and appreciate the
UNIX’s simplicity”
16
II. Process Management
17
Overview





What is process?
process state transition
context
scheduling
kernel entry point
 interrupt, trap, system call

signal
18
What is Process?

Definition
 an instance of a running program (runnable program)
 an execution environment of a program
 scheduling entity
 a control flow and address space
 PCB (Process Control Block) : proc. table and U area

Manipulation of Process
 create, destroy
 context
 state transition



dispatch (context switch)
sleep, wakeup
swap
19
Process State Transition
user
running
syscall,
interrupt
fork
initial
(idle)
return from
syscall or
interrupt
kernel
running
fork
swtch
zombie
exit
wait
sleep, lock
swtch
ready
to run
wakeup, unlock
swap
asleep
swap
suspended
ready
suspended
asleep
(Source : UNIX Internals)
20
Process State Transition (Cont`)

Flow of execution : execution mode (cf: address space)
Kernel execution
process A execution
Kernel execution
process B creation
Interrupt or Trap
cause change of
execution modes
process C execution
Kernel execution
process B execution
Kernel execution
(Source : Magic Garden)
21
Context

context : system context, address (memory) context, H/W context
memory
proc table
file table
segment table
page table
fd
Registers (TSS)
eip
sp
eflags
eax
swap
cs
disk
….
U area
….
22
Context : system context

System context
 proc. Table







identification: pid, process group id, …
family relation
state
sleep channel: sleep queue
scheduling information : p_cpu, p_pri, p_nice, ..
signal handling information
address (memory) information
 U area
stores hardware context when the process is not running currently
 UID, GID
 arguments, return values, and error status for system call
 signal catch function
 file descriptor
 usage statistics
 May it be different according to the version and variant of UNIX

23
Context : address context

fork example
int
char
glob = 6;
buf[] = “a write to stdout\n”;
int main(void)
{
int var;
pid_t pid;
var = 88;
write(STDOUT_FILENO, buf, sizeof(buf)-1);
printf(“before fork\n”);
if ((pid = fork()) == 0) {
glob++; var++;
} else
sleep(2);
/* child */
/* parent */
printf(“pid = %d, glob = %d, var = %d\n”, getpid(), glob, var);
exit (0);
}
(Source : Adv. programming in the UNIX Env., pgm 8.1)

guess what can we get from this program?
24
Context : address context (Cont`)

fork internal : compile results
gcc
test.c
header
text
0xffffffff
0xbfffffff
…
movl %eax, [glob]
addl %eax, 1
movl [glob], %eax
...
glob, buf
data
kernel
bss
stack
var, pid
stack
0x0
data
text
a.out : ELF format
Executable and Linking Format
user’s perspective (virtual address)
25
Context : address context (Cont`)

fork internal : before fork (after run a.out)
memory
proc T.
segment T.
text
var, pid
pid = 11
stack
glob, buf
data
cf) we assume that there is no paging mechanism in this figure.
26
Context : address context (Cont`)

fork internal : after fork
proc T.
memory glob, buf
segment T.
data
pid = 11
text
var, pid
stack
proc T.
segment T.
glob, buf
pid = 12
data
stack
 address space : basic protection barrier
27
var, pid
Context : address context (Cont`)

fork internal : with COW (Copy on Write) mechanism
after “glob++” operation
after fork with COW
memory
proc T.
segment T.
pid = 11
proc T.
text
segment T.
pid = 11
text
stack
proc T.
data
stack
segment T.
proc T.
pid = 12
segment T.
pid = 12
data
data
28
Context : address context (Cont`)

execve internal
memory
proc T.
segment T.
data
pid = 11
text
a.out
stack
text
data
stack
29
header
text
data
bss
stack
Context : hardware context

time sharing (multitasking)
Where am I ??
time quantum
process 1
…
process 2
process 3
30
Context : hardware context (Cont`)

brief reminds the 80x86 architecture
ALU
Control Unit
IN
OUT
Registers
• eip, eflags
• eax, ebx, ecx, edx, esi, edi, …
• cs, ds, ss, es, ...
• cr0, cr1, cr2, cr3, GDTR, TR, ...
31
Context : hardware context (Cont`)

context swtch
save
context
Proc T.
TSS
eip
sp
eflags
eax
CPU
Proc T.
cs
U area
restore
context
TSS
eip
sp
eflags
eax
cs
U area
32
Context : hardware context (Cont`)

context swtch : pseudo-code in UNIX
…
/* need context swtch */
if (save_context())
{
/* pick another process to run from ready queue */
….
restore_context(new process)
/* The control does not arrive here, NEVER !!! */
}
/* resuming process executes from here !!! */
…...
(Source : The Design of the UNIX OS)

trick : register (eg, eax in 80*86 CPU)
 Think about the difference between context switch and system call.
33
Process Scheduling

Process scheduling
allocate CPU resource among the competing processes
 criteria : fairness, efficiency (response time vs. throughput)

types of processes
 Interactive
 Batch (Computation-Intensive)
 Real-time

video,hospital
types of scheduling
 Preemptive scheduling

other processes can take CPU away from the current running process
 Non preemptive scheduling(Windows98)

other processes can not take CPU away from the current running process
34
스케줄링 기준


중앙처리장치 이용률(utilization)
처리율(throughput)
 완료프로세스/시간

반환 (turnaround) 시간
 프로세스 시작->끝

대기(waiting)시간
 준비 큐에서 보낸 시간의 합

응답(response)시간
 작업제출 후 응답이 시작될 때까지 걸리는 시간
35
Process Scheduling (Cont`)

Existing Policies
 FCFS (First Come First Served)
 RR (Round-Robin)
 SJF (Shortest Job First)
 Multilevel Feedback Queue
 EDF (Earliest Deadline First)
 RM (Rate Monotonic)
 Fair Queuing
 Gang Scheduling
 Causality Scheduling
 Process migration
36
은행
time quantum(10-100milisec)
여러 개의 큐
Process Scheduling (Cont`)

UNIX : Round Robin with multilevel Feedback Queue
 Round-Robin
Ready Queue
P3
P2
P1
37
CPU
Process Scheduling (Cont`)
 Multilevel Feedback Queue
Ready Queue 1
P8
P7
P6
CPU
P4
CPU
Ready Queue 2
P5
•higher priority
•less time quantum
…….
Ready Queue n
P3
P2
CPU
P1
38
Process Scheduling (Cont`)

Round-Robin : real implementation
 scheduling information in proc. table : p_pri, p_cpu, p_nice


every clock tick : increments p_cpu for current running process
every second : p_cpu = p_cpu * decay factor (generally 1/2)
p_pri = PUSER + p_cpu/2 + p_nice
 Example of System III

3 process, PUSER=50, p_nice = 0, clock ticks 60 at every second
P1
P2
P3
p_pri p_cpu
p_pri p_cpu
p_pri p_cpu
second
0
50
0
50
0
50
0
1
65
30
50
0
50
0
2
57
15
65
30
50
0
3
53
7
57
15
65
30
4
66
33
53
7
57
15
39
Process Scheduling (Cont`)
 Example of BSD




decay factor : (2*load_average) / (2*load_average + 1)
p_pri = PUSER + (p_cpu/4) + (2*p_nice)
clock tick is 10msec
time quantum is 10 clock ticks
 Example of Mach


decay factor : 5/8
p_usrpri = PUSER + (3.8*(max(1,M/P) ) * p_cpu )/T + 0.5 * p_nice
 Example of SVR4


support REAL-TIME class process
class independent scheduler / class dependent scheduler
 Example of LINUX



support REAL-TIME process
select a process that has the highest value of “priority + counter”
“counter” of the current process decreases at each clock tick.
40
Process Scheduling (Cont`)

Range of Process Priorities
Kernel Mode
Priority
Swapper
P
Waiting for Disk I/O
P
P
P
Waiting for Buffer
Waiting for Inode
P
Waiting for TTY IO
User Mode
Priority
Waiting for Child Exit
P
P
User Level 0 (50)
P
P
User Level 1
P
……
P
User Level n
41
P
P
(Source : The Design of the UNIX OS)
Kernel Entry Point



Interrupt
Trap
system call
device
kernel
process
MM
HWM
PM
FS
42
DD
Interrupt Handling

Interrupt
 a mechanism that peripheral devices inform an asynchronous event to
UNIX Operating System
Real time Clock
Kernel
CPU
IVT
disk
PIC
tty
network
interrupt handlers
0
clock()
clock()
1
nmi()
disk_intr()
2
tty_intr()
3
disk_intr()
4
net_intr()
cdrom
….
 what’s the difference between polling and interrupt?
43
합격자 발표
Interrupt Handling (Cont`)

interrupt handling mechanism
 similar to the step of receiving a letter while telephoning
 step




if user mode, change kernel mode
save context of current process (make new context layer)
determine interrupt source
find interrupt vector and call interrupt handler
…. interrupt handling…..

restore saved context
 what if another interrupt is triggered while handling a interrupt?
44
Interrupt Handling (Cont`)
clock interrupt handler ( timer_interrupt() in Linux )

clock()
{
restart clock
/* will interrupt again */
if (callout table not empty) (eg) timer_list in LINUX)
adjust time and schedule callout function if necessary
if (profiling on)
count program counter at time of interrupt
gather statistics per process and system
update CPU usage for the current running process
if (one second elapsed) {
alarm handling
calculate the p_pri for all process
reschedule if necessary
wake up swapper or page daemon if necessary
}
}
(Source : The Design of the UNIX OS)
45
Trap Handling

trap : an asynchronous software event
IVT
0
div_by_zero()
1
invalid_opcode()
2
overflow()
3
segment_fault ()
4
page_fault ()
….
20
clock()
21
nmi()
22
tty_intr()
23
disk_intr()
24
net_intr()
….
80
system_call()
….
46
System Call Handling

system call : an example of trap
Kernel
trap
sys_call_table (sysent[])
IVT
0
div_by_zero()
0
sys_no_syscall()
1
invalid_opcode()
1
sys_exit()
2
overflow()
2
sys_fork()
3
segment_fault ()
3
sys_read ()
4
page_fault ()
4
system_call()
sys_write ()
….
80
system_call()
47
….
….
sys_getpid()
….
255
sys_no_syscall()
47
sys_fork()
sys_read()
System Call Handling (Cont`)

invoke system call
Kernel
process
main()
{
….
fork()
}
libc.a
….
fork()
{
….
movl $2, eax
trap $80
….
}
….
read()
{
…
}
IVT
sys_call_table (sysent[])
0 div_by_zero()
0 sys_no_sys()
1 in_opcode()
1 sys_exit()
2 overflow()
2 sys_fork()
sys_fork()
3 seg_fault ()
4 page_fault ()
3 sys_read ()
4 sys_write ()
sys_read()
….
….
80 system_call()
….
47 sys_getpid()
….
255 sys_no_sys()
48
System Call Handling (Cont`)

how to make a new system call
 coding new system call function in kernel space
 allocate syscall_number (and an empty slot in sys_call_table[])
and registering
 kernel rebuild
 reconfigure library

ar, ranlib
 coding your program with new system call
49
Signal

a mechanism to inform an asynchronous event to process
 types of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, ….
 action : abort, exit, ignore, stop, user level catch function
void sig_handler(signo)
int signo;
{
signal (SIGUSR1, sig_handler);
printf(“received signal %d\n”, signo);
…..
}
/* reinstall */
/* handle the signal */
main ()
{
signal (SIGUSR1, sig_handler);
….
for ( ; ; )
pause();
/* install the handler */
}
 what’s the difference among interrupt, trap, and signal?
50
Signal (Cont`)
 register signal handler (signal catch function )
 send signal
 signal detection : state transition from kernel running to user running
 call signal handler
 variables for signal in task structure in LINUX



int sigpending : is signal received or not?
struct signal_struct *sig
sigset_t signal, blocked
typedef struct {
unsigned long sig[_NSIG_WORDS];
} sigset_t; /* asm-i386/signal.h */
struct sigaction /* asm-i386/signal.h */
struct signal_struct /* sched.h */
count
action[_NSIG]
siglock
51
sa_handler
sa_flags
sa_restorer
sa_mask
III. Memory Management
52
Memory Hierarchy

hierarchy
register
CPU cache
Main Memory
• larger capacity
• lower speed
• lower cost
Secondary Storage
Server (or INTERNET)
 caching is more and more important (how to keep consistency?)
53
Memory Management Strategy

Three strategies
 Fetch strategy: when a process (page) is brought into memory?


demand fetch
prefetch (agent in Web)
 Placement strategy: where a process (page) is put on memory?

first fit, best fit, worst fit
 replacement strategy: which process (page) is evicted from memory?

LRU, LFU, MRU, …
54
History of Memory Management System

single user system (stone age of memory management)
 overlay

fixed partition multiprogramming system
 absolute assembler, relocating assembler

variable partition multiprogramming system
 coalescing , compaction

virtual memory system
 paging
 segmentation (segment, region, vm_object)
 paging/segmentation
55
중첩(Overlay)


할당된 기억장치보다 큰 프로세스를 위해
예) 2-pass 어셈블러
심볼테이블(20K)
공통루틴(30K)
중첩드라이버(10K)
pass 1 (70K)
pass2 (80K)
56
History (Cont`)

variable partition multiprogramming system
Scenario
• fork P1 (40K)
• fork P2 (20K)
• fork P3 (10K)
• fork P4 (20K)
• fork P5 (40K)
• fork P6 (20K)
• fork P7 (70K)
• exit P1
• exit P3
• exit P4
• exit P6
memory and kernel internals
0
kernel
free memory map
100
P1
140
P2
P3
P4
P5
160
170
190
230
100
140
40
160
190
30
230
250
20
320
400
80
P6
250
P7
320
400
57
Memory Management Strategy : Placement
memory and kernel internals
0
kernel
free memory map
100
P1
140
160
170
190
P2
P3
P4
P5
100
140
40
160
190
30
230
250
20
320
400
80
230
250
P6
P7
320
Where to go??
400
58
Scenario
• fork P1 (40K)
• fork P2 (20K)
• fork P3 (10K)
• fork P4 (20K)
• fork P5 (40K)
• fork P6 (20K)
• fork P7 (70K)
• exit P1
• exit P3
• exit P4
• exit P6
• fork P8 (25K)
Memory Management Strategy : Placement
memory
0
kernel
kernel internals
100
free memory map
P1
140
160
170
190
P2
P3
P4
P5
first fit
best fit
Scenario
100
140
40
fork P8 (25K)
160
190
30
230
250
20
320
400
80
230
250
P6
P7
worst fit
320
400
 issue : fragmentation
 employed at swap management, KMA (kernel memory allocator)
59
Virtual Memory

virtual memory : separate virtual address and physical address
virtual address
kernel stack
0xffffffff
kernel
kernel bss
stack
kernel data
kernel text
bss
data
text
page
0x0
60
Virtual Memory (Cont`)

virtual address : Linux case
0xffffffff
kernel
0xc0000000 env_end
arg_end
arg_start
start_stack
stack
shared memory
bss
data
text
bss
end_bss
end_data
data
text
end_code
start_code
brk
shared C library
bss
end_data
data
end_code
0x0
other shared library
program
text
start_code
(Source : Linux Internals)
61
Virtual Memory (Cont`)

physical memory
 consists of kernel and a set of processes
physical memory
0x4ffffff
P4
P3
P2
P1
kernel
0x0
62
Virtual Memory (Cont`)

physical memory
 a collection of page frame (4K or 8K)
physical memory
P1
page frame n
page frame n-1
….
P2
page frame 5
page frame 4
page frame 3
page frame 2
page frame 1
63
P3
Virtual Memory (Cont`)

address translation
segment table
origin register
virtual address
v = (s, p, d)
offset
segment page
number
number
p
d
s
b
+
s'
+
segment table
p'
page frame
number
p'
page table
64
offset
d
physical address
Virtual Memory (Cont`)

address translation : table structure
V segment start address (s’) L R W E A
segment table
V page frame number (p’)
D R U W COW
page table
cf) disk block descriptor per each page table entry
swap (fs) number block number type (fill 0, demand fill)
65
Virtual Memory (Cont`)

execve (final)
memory
nK
n-1 K
proc T.
segment T.
1
1
0
1
0
0
0
0
0
0
1
0
4K
28 K
20 K
12 K
32 K
28 K
24 K
20 K
16 K
12 K
8K
4K
0K
page T.
66
T2
a.out
0K
text
D1
12 K
S1
T1
header
48 K
data
stack
Virtual Memory (Cont`)

anonymous
pages of segment
SVR 4.0 virtual memory structure
struct proc
p_as
struct as
seg_list
hint
struct hat
struct seg
as_ptr
private
s_ops
base
size
struct
segvn_data
private
data
as_ptr
private
s_ops
base
size
anon_map
vnode
resident
pages of file
virtual address space
as_ptr
private
s_ops
base
size
text
data
stack
as_ptr
private
s_ops
base
size
u area
67
Virtual Memory (Cont`)

BSD (Mach) virtual memory structure
struct task
vm_map
struct vm_map
first hint last
struct vm_map_entry
struct vm_object
struct vm_page
resident
page list
68
struct pmap
Virtual Memory (Cont`)

Linux virtual memory structure
task_struct
mm
mm_struct
count
pgd
mmap
vm_area_struct
vm_end
vm_start
vm_flag
vm_inode
vm_end
vm_area_struct
vm_end
vm_start
vm_flag
vm_inode
vm_end
69
Data
Code
Virtual Memory (Cont`)

advantage of virtual memory
 large address space
 no need of placement strategy
 flexible memory object sharing among the processes
P1
segment T.
1
1
0
1
0
4K
28 K
memory
20 K
page T.
P2

segment T.
1
1
0
1
8K
28 K
40 K
page T.
no free lunch : disadvantage of virtual memory
 address translation
70
Virtual Memory (Cont`)

address translation with TLB (Translation Lookahead Buffer)
segment table
origin register
virtual address
v = (s, p, d)
offset
segment page
number
number
p
d
s
b
+
s p
p'
s'
TLB (associative memory)
+
segment table
p'
page frame
number
p'
page table
71
offset
d
physical address
Virtual Memory (Cont`)

HAT (Hardware Address Translation)
 isolate all hardware dependent code

HAT in SVR4, pmap in BSD, pgd in Linux, ...
 responsible all address translation transparently
 case study : 80*86 CPU
segment descriptor
table (GDT, LDT)
virtual address
16bit
segment descriptor
32bit
offset
segment translation
32bit
linear address
72
cf) 80*86 reminds
GDT - available for all tasks
- segment for OS code data
- descriptor for LDT, TSS
LDT - for a specific task
IDT - interrupt service routine
Virtual Memory (Cont`)

HAT (Hardware Address Translation):Paging
 case study : 80*86 CPU
31
linear address
22 21
12 11
0
DIR PAGE offset
31
11 0
31
11 0
31
PFN
11
PFN
PFN
0
offset
physical address
page directory
page table
CR3
control register:
Page Directory Base Register

31
11
page table entry
PFN
73
6 5
2 1 0
DR
UWP
•D: Dirty
•R: referenced
•U:User/Supervisor
•W:Read/Write
•P:Present(valid)
Replacement Strategy

Which page can be evicted from memory ?
memory
replacement policy
p2
p4
p1
p3
p7
page fault for p8
p8
disk
 goal : reduce the number of page fault and thrashing
74
Replacement Strategy (Cont`)

basic principle of replacement : locality
 temporal locality : stack, tree traverse, counting variable
 spatial locality : array, sequential code, file reference

replacement policy
 FIFO (First In First Out)
 LRU (Least Recently Used)
 LFU (Least Frequently Used)
 NUR (Not Used Recently)
 MRU (Most Recently Used)
 Working Set
 Second Chance(FIFO+reference bit)
75
Replacement Strategy (Cont`)

example : FIFO, LRU, LFU
scenario : page reference order
system internals
p1, p2, p3, p1, p4, p2, p1, p3, p4, p7, p8
memory
p2
p4
p1
p3
p7
disk
p8
 guess which page will be evicted from memory under the LRU policy?
 which policy is the best policy?
76
Replacement Strategy (Cont`)

Project I : program a simulator for FIFO, LRU, and LFU policy and
compare their performance.
 assume
- memory consists of 20 page frames
- a range of page number is 0 ~ 49
- number of references is 300
 program the 3 policies - use linked list for FIFO and LRU
- use priority tree for LFU if possible
- use hash to fast find a page
 compare the performance and discuss it
77
Replacement Strategy (Cont`)

Example of real implementation in UNIX : buffer cache
head
lru list header
hash queue header
tail
(page_no % 5 ) = 0
10
45
(page_no % 5 ) = 1
21
26
(page_no % 5 ) = 2
2
(page_no % 5 ) = 3
33
28
(page_no % 5 ) = 4
24
19
30
3
43
(Source : The Design of the UNIX OS)
78
Replacement Strategy (Cont`)

example : NUR
 used by pagedaemon (two-handed clock algorithm)
V page frame number (p’)
possible
combination
D R U W COW
0
0
1
1
0
1
1
0
79
replace page having (0,0)
combination first
Swapper vs. PageDaemon

swapping and paging
 replace some object from memory when memory is almost full.

swapping
 object : process
 swap in/ swap out
 swap space management


similar to variable partition multiprogramming
paging
 object : page
 page fault handling
80
IV. File System
81
Overview of File System
process 1
….
process 2
process n
User mode
System mode
Virtual File System
ffs
nfs
ext2fs
ntfs
buffer cache
….
mmfs
procfs
File System
device driver
82
User Interface

System call
 open
 read/write
 close
 dup
 link
 pipe, mkfifo
 mkdir, readdir
 mknod
 stat
 mount
 sync, fsck
83
User Interface (Cont`)

file descriptor, file table, inode (vnode)
proc table
fd
segment table
file table
vnode
inode
TSS
U area
84
User Interface (Cont`)

fork vs open
fork
proc table
open same file
fd
vnode
proc table
file table
vnode
file table
parent
proc table
fd
parent
fd
file table
child
how about dup?
85
Disk system

physical view
 plotter, arm, head
 cylinder, track, sector
 seek time, rotational latency, transmission time

logical view (a viewpoint of UNIX)
 disk is a collection of disk blocks
 the disk block size is usually equal to the page frame size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
….
86
Structure of File

disk block allocation
 want to create a file with size of 14 K
 assume - disk block size is 4 K.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 sequential allocation
 non sequential allocation

block chain, indexed block, FAT
87
..
Structure of File (Cont`)

non sequential allocation
 block chain
new file name
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
88
..
Structure of File (Cont`)

non sequential allocation
 index block
new file name
…...
index block
 what if the index block is full ?
89
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
..
Structure of File (Cont`)

non sequential allocation
 FAT (File Allocation Table)
FAT
new file name
4
5
NIL
12
11
6
9
21
34
NIL
UN
NIL
7
UN
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
..
 what is the adv. and disadv. among block chain, index block, and FAT ?
90
Structure of File (Cont`)

sequential allocation
new file name
start
size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
..
 what is the adv. and disadv. between sequential and non sequential allocation ?
91
Structure of File (Cont`)

inode in Unix File System
inode
type (4bit) u g s r w x r w x r w x
i_inode_number
i_mode
i_nlink
i_uid, gid
i_rdev
i_atime, ctime, mtime
S_IFSOCK
S_IFLNK
S_IFREG
S_IFBLK
S_IFDIR
S_IFCHR
S_IFIFO
direct
….
indirect
92
Structure of File (Cont`)

inode in Unix File System: find block
 assume the size of disk block is 4K
 which block is related if f_offset is 10000 ? (or 47000 )
file table
inode
4
7
12
18
f_offset
24
direct
33
….
41
indirect
165
93
169
Structure of Directory

connect file name to disk block(s)
directory entry in UNIX FS
inode number
file name
directory entry in DOS
file name extension attributes time first block number

provide hierarchical structure for file system
inode 1 disk block 1
inode 3
i_mode
time
….
1
i_mode
time
….
7
1
1
3
4
5
6
7
9
..
.
usr
dev
etc
vmunix
var
mnt
disk block 7
1 ..
3 .
12 src
16 include
17 lib
20 bin
23 member
25 local
94
inode 23 disk block 39
i_mode
time
….
39
3 ..
23 .
32 jim
33 tom
37 mark
41 sooni
42 mjc
Structure of Directory (Cont`)

hierarchical view
/
usr
src
dev
include
etc
lib
jim
var
bin
member
tom
mark
95
mnt
vmunix
local
sooni
mjc
Structure of Directory (Cont`)

open example
 open(“/usr/member/sooni/test.c”, O_RD)


find inode using directory structure (namei())
allocate fd, file table and initialize
proc table
fd
file table
inode
f_offset
….
96
Structure of File System

file system: boot, super, inode, data block
/dev/hda
/dev/hdb
system
/dev/hda1
/dev/hda3
/dev/hda2
boot
super
i-node
disk blocks
97
Structure of File System (Cont`)

super block : manage information for file system
 (cf: inode for file)
struct superblock
s_type
s_flag
s_dev
s_blocksize
s_magic
s_name
….
s_free_inode []
s_free_disk block []
free inode list (map)
...
free disk block list (map)
...
 iget, iput
 balloc, bfree
98
Structure of File System (Cont`)

super block
struct superblock
s_type
s_flag
s_dev
s_blocksize
s_magic
s_name
….
s_free_inode []
s_free_disk block []
29 27 26 24 21 20 19
61 57 56 54 51 50 48 46 45 43 42 41 39 38 37 34
disk block 29
disk block 61
……
99
Structure of File System (Cont`)

mount
vfsmntlist
“mount /dev/hda3 /mnt”
super block
for /dev/hda3
inode for /mnt
inode for root on FS of /dev/hda3
 open(“/mnt/test.c”, O_RD)
100
s_dev
s_blocksize
mounted point
root inode
...
vfsmount
mmt_sb
vfsmount
Inode for special file

inode structure for special file
 pipe


no indirect block (unnamed pipe)
readers, writers, read pointer, write pointer
 special device file




no direct, indirect block
device number : major number + minor number
major number : corresponding device type
used as index for device switch table
minor number : corresponding device unit
pass as argument to device driver
101
Existing File System

S5FS
 first and conventional UNIX file system

FFS
 support 255 characters file name
 cylinder groups
 fragments

LFS
 small write optimize
 suitable for RAID storage system

directory entry for ffs
i_no size file_name
fast file system structure
boot block
super block
cylinder group 1
(inode, disk blocks)
cylinder group 2
VxFS (Journaling File System)
 fast recovery using internal logging
…...
102
Existing File System

ext2 File System
 Linux default file system
 similar to Berkeley’s FFS
 inode : 12 direct block
 used bitmap for free block and inode management
 fault-tolerant features
Ext2 file system structure
super block
boot block
Group descriptor
Block group 0
Block bitmap
Block group 1
Inode bitmap
……
Inode table
Block group n
Data Blocks
103
Existing File System

NFS
 stateless protocol
 XDR (Extended Data Representation)

AFS, Coda File System
 disconnected operation

Sprite File System
VFS

application
nfsd
VFS
 to support various file system

nfs server
system call
 strong consistency

nfs client
mfs
procfs
VFS
NFS
RPC stub
104
NFS
RPC stub
XDR
UFS
swap space management

swap space management
P1
stack
swap space
0
P1
P2
P3
P4
P5
P6
data
text
P2
stack
data
text
64M
105
swap space management

swap used map
Scenario
• swap out P1 (3M)
• swap out P2 (3M)
• swap out P3 (2M)
• swap out P4 (1M)
• swap out P5 (3M)
• swap out P6 (4M)
• swap in P2
• swap in P4
• swap in P5
swap used map
3
6
3
P1
8
12
4
P2
16 64 48
swap space
0
P3
P4
P5
P6
64M
why does UNIX manage swap space differently to the FS ?
106
V. Inter-Process Communication
107
Inter-Process Communication (IPC)





synchronization
pipes
communication via files
signal
System V IPC
 message queue
 shared memory
 semaphore

IPC with sockets
108
synchronization

parallelism
 multiprocessor (true parallelism) or time sharing (quasi-parallelism)
 race condition : more than one process want to access a same resource
 shared resource

mutual exclusion
 only one process can exclusively access a shared resource at a time
 critical section : a portion of a program that accesses a shared resource
 representative mechanism: ipl, lock, semaphore, test&set

deadlock
109
synchronization (Cont’)
example of race condition I

int main(void)
{
pid_t pid;
if ((pid = fork()) == 0) {
/* child */
charatatime(“output from child\n”);
} else {
charatatime(“output from parent\n”);
}
exit (0);
}
void charatatime(char *str)
{
char *ptr; int c;
setbuf(stdout, NULL);
for (ptr = str; c=*ptr++; )
putc(c, stdout);
}
(Source : Adv. programming in the UNIX Env. pgm 8.7)

guess what the results are?
110
outpuot utfprut froom chmild
parent
synchronization (Cont`)

system internals
task structure
fd
file structure
inode
f_pos
shared resource
fd
111
synchronization (Cont`)

example of race condition II
 scenario



process P1 is currently dispatching (removing from ready queue)
disk interrupt occurs
disk interrupt handler wake up process P2 and want to insert it into ready
queue
RQ
P2
RQ
RQ
P4
P1
P4
P1
P4
P1
P3
P3
P3
112
synchronization (Cont`)
ipl (interrupt priority level)

BSD
SVR4
Purpose
spl0
spl0
enable all interrupts
splsoftclock
spltimeout
disable functions scheduled by timers
splnet
disable network protocol processing
splstr
disable STREAMS interrupts
spltty
spltty
disable terminal interrupts
splbio
spldisk
disable disk interrupts
splclock
disable hardware clock interrupt
splhigh
spl7 or splhi
disable all interrupts
splx
splx
restore ipl to previously saves value
113
synchronization (Cont`)

lock
 associate lock variable to each shared resource
 lock before (unlock after) the critical section
 spin_lock primitive
void spin_lock(spinlock_t *s) {
while (test_and_set (s) != 0)
;
}
void spin_unlock (spinlock_t *s) {
*s = 0;
}
(Source : UNIX internals)
114
synchronization (Cont`)
 sleep_lock
process wants resource
lock the resource
No
is it locked?
Yes
use resource
sleep on resource
unlock resource
awakened by any process
Yes
wake up all waiting processes
does anyone want it?
No
continue other processing

spin lock or sleep lock, lock granularity, rw_lock (try_lock)
115
synchronization (Cont`)

semaphore
 an object that can be accessed P and V (and sem_initialize) method.
 semaphore primitive
void initsem (semaphore_t *sem, int val) {
*sem = val;
}
void P (semaphore_t *sem) {
*sem -= 1;
while (*sem < 0)
sleep;
}
void V (semaphore_t *sem) {
*sem += 1;
if (processes slept on sem queue)
wake up the processes slept on sem;
}
(Source : UNIX internals)
116
synchronization (Cont`)

semaphore : example
client
server
shared memory
remove an item from
shared memory
produce an item
put the item into
shared memory
consume the item
117
synchronization (Cont`)

semaphore : example
client
server
sem1, sem2
shared memory
produce an item
initsem(sem1, 5)
initsem(sem2, 0)
P(sem1)
P(sem2)
put the item into
shared memory
remove an item from
shared memory
V(sem1)
V(sem2)
consume the item
118
synchronization (Cont`)

semaphore in the linux kernel
 widely used for ‘wait until condition meet’ (eg read disk blocks)
 semaphore /* include/asm-i386/semaphore, kernel/sched.c */

declare semaphore for each shared resource
struct semaphore {
atomic_t count;
struct wait_queue *wait;
}
void down (struct semaphore *sem) {
while (sem->count <= 0)
sleep_on (&sem->wait);
sem->count--;
}
void up (struct semaphore *sem) {
sem->count++;
wake_up (&sem->wait);
}
119
down(x)
critical section
up(x)
down(x)
critical section
up(x)
process 1
process 2
shared resource
struct semaphore *x
synchronization (Cont`)

semaphore in the linux kernel
 sleep, wakeup /* include/linux/wait.h kernel/sched.c */
struct wait_queue {
struct task_struct *task;
struct wait_queue *next;
}
void sleep_on (struct wait_queue *queue) {
struct wait_queue entry = {current, NULL};
current->state = TASK_UNINTERRUPTABLE;
add_wait_queue (queue, &entry);
schedule();
remove_wait_queue(queue, &entry);
}

void wake_up (struct wait_queue *queue) {
struct wait_queue *p = *queue;
do {
p->task->state = TASK_RUNNING;
add_runqueue(p); p->p->next;
} while (p != *queue);
}
interruptible_sleep_on(), wake_up_interruptible()
120
synchronization (Cont`)

Deadlock
 system state that processes wait events that never occur.
process 1
resource 1
process 2
resource 2
process 3
resource 3
resource 4
process 4
121
synchronization (Cont`)

Deadlock
 deadlock prevention
 deadlock avoidance
 deadlock detection and correction
reduction of resource allocation graph
R1
R1
R1
P2
P1
P2
P1
P3
R2
R1
P2
P2
P1
P1
P3
P3
R2
122
P3
R2
R2
pipe

named pipe, unnamed pipe
 pipe(fd[]), mkfifo(path, mode), mknod(path, mode, dev_t)
process 1
process 2
write fd
write fd
read fd
pipe
kernel
 no indirect blocks in inode
 rd_pointer, wr_pointer, number of readers, number of writers
123
S_IFREG
S_IFCHR
S_IFBLK
S_FIFO
pipe

pipe(unnamed pipe)
 limit




cannot broadcast
no object boundaries
cannot direct data to a specific reader
FIFO(named pipe)




FIFO file
must be explicitly deleted(unlink)
named
less secure than pipe
124
pipe (Cont`)
 example of pipe : “% ls -l | more”
for (;;) {
read_command();
parsing_command();
pipe(fd[]);
if (fork()) {
close(stdin);
dup(fd[0]);
if (fork()) {
close(stdout)
dup(fd[1]);
exec(“ls”, …);
}
exec(“more”, …);
}
wait();
}
125
Communication via files

the oldest way of data exchanging among processes
P
P
file

race condition may be occurred
 reading a data before the other has completed modifying it
 mandatory or advisory locking
 lockf, flock, fcntl

fcntl(fd, cmd, arg)
flock structure
l_type
l_whence
l_start
l_len
l_pid
F_GETLK, F_SETLK, …...
126
F_RDLCK, F_WRLCK,
F_UNLCK,
F_SHLCK, F_EXLCK
Communication via files (Cont`)
 A deadlock scenario with file locking
file
P

P
In Linux, fcntl() returns the error EDEADLOCK
127
Signal
 register signal handler (signal catch function )
 send signal
 signal detection : state transition from kernel running to user running
 call signal handler
 variables for signal in task structure



int sigpending : is signal received or not?
struct signal_struct *sig
sigset_t signal, blocked
typedef struct {
unsigned long sig[_NSIG_WORDS];
} sigset_t; /* asm-i386/signal.h */
struct sigaction /* asm-i386/signal.h */
struct signal_struct /* sched.h */
count
action[_NSIG]
siglock
128
sa_handler
sa_flags
sa_restorer
sa_mask
System V IPC

Message, Shared Memory, and Semaphore

Common properties
 Key => id (cf: file name => fd)
 In kernel, ***id_ds for System V IPC (eg: msqid_ds)
 ipc_perm: key, uid, cuid, access mode, …
 ipcs, ipcrm

Difference
 message : suitable for Object-Orient Concept
 shared memory : fast
 semaphore : for user level synchronization
129
System V IPC (Cont`)

message queue
 msqid = sys_msgget (key, flag)
 sys_msgsnd (msqid, msgp, msgsz, flag)
 sys_msgrcv (msqid, msgp, msgsz, msgtype, flag)
 sys_msgctl(msqid, cmd, msqid_ds)
senders
struct
msqid_ds
P
/* create */
/* send */
/* receive */
/* control */
receivers
P
P
msg
msg
msg
P
P
130
System V IPC (Cont`)
 struct msqid_ds
P
P
P
msg_perm
msg_first
msg_last
msg_stime
msg_rtime
msg_ctime
wwait_queue
rwait_queue
msg_cbytes
msg_qnum
msg_qbytes
msg_lspid
msg_lrpid
msg_next
msg_type
msg_spot
msg_ts
msg_next
msg_type
msg_spot
msg_ts
msgtype in sys_msgrcv()
=0 : receive the first msg in the queue
>0 : receive the given type msg in the queue
<0 : receive the msg having the smallest value
131
System V IPC (Cont`)

shared memory
 shmid = sys_shmget (key, size, flag)
 sys_shmat (shmid, shmaddr, shmflag, raddr)
 sys_shmdt (shmaddr)
 sys_shmctl(shmid, cmd, shmid_ds)
struct shmid_ds
shm_perm
shm_segsz
shm_atime
shm_dtime
shm_ctime
shm_cpid
shm_lpid
shm_nattach
shm_npage
shm_pages /* for page table entries */
attaches
/* struct vm_area_struct */
132
System V IPC (Cont`)
 using shared memory
vm area of
process A
vm area of
process B
kernel
stack
kernel
stack
0xa27e8000
0x77ed000
0xa27e0000
heap
data
text
heap
data
shared memory
region
133
text
0x77e5000
System V IPC (Cont`)

semaphore
 semid = sys_semget (key, nsems, flag)
 semop (semid, sops, nsops)
 semctl(semid, semnum, cmd, *arg)
struct sembuf sops;
struct sembuf {
unsigned short sem_num;
short sem_op;
short sem_flg;
}
if (sem_op > 0)
V() operation
else
P() operation struct
134
struct semid_ds
sem_perm
sem_otime
sem_ctime
sem_base
sem_pending
……
sem_nsems
socket

socket
 common interface for IPC and networking
 Protocol family: UNIX, INET, AX25, IPX, Appletalk

layer structure of a network
BSD socket
INET
TCP
UDP
IP
PLIP
SLIP
parallel
port
serial
port
ETHERNET
Ethernet
card
135
ARP
socket (Cont`)

information for communication
 5-tuple {protocol, local-addr, local-process, foreign-addr, foreign-process

C library routines
 socket() : protocol, make socket structure
 bind()
: assign local-addr and local-process
 connect() : foreign-addr, foreign-process
 listen()
 accept()
: waiting in server
: make connection to a client
 read(), write()
 send(), sendto(), recv(), recvfrom()
cf) system call: sys_socketcall
/* net/socket.c */
136
socket (Cont`)

socket structure
file
….
f_dentry
….
f_pos
f_op
/* net/socket.c */
sock_lseek
sock_read
sock_write
NULL
sock_poll
sock_ioctl
NULL
sock_no_open
….
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
137
/* include/net/sock.h */
struct sock {
...
}
/* include/linux/net.h */
struct proto_ops {
family
dup, release,
bind, connect,
accept, listen,
...
getsockops
setsockops
sendmsg
recvmsg
}
/* for INET operation */
socket (Cont`)

connection oriented protocol
server
socket()
bind()
listen()
client
accept()
socket()
blocks until connection from a client
connect established
connect()
write()
read()
data (request)
processing request
write()
data (reply)
138
read()
socket (Cont`)

connectionless protocol
server
socket()
client
bind()
socket()
recvfrom()
bind()
blocks until data received from a client
sendto()
data (request)
processing request
sendto()
data (reply)
139
recvfrom()
TLI

connection oriented protocol
server
client
t_open()
t_open()
t_bind()
t_bind()
t_listen()
t_connect()
wait for connection
connection request
t_accept()
t_rcv()
data (request)
t_snd()
data (reply)
t_rcv()
processing request
t_snd()
140
VI. I/O System (Device Driver)
141
Role of a device driver

handle data movement between memory and peripheral devices
 usually written by a third-party
P
P
P
P
system call interface
kernel
file system
device driver interface (through devsw table)
tty
driver
disk
driver
142
network
driver
Peripheral Device: General Structure

H/W configuration
 extremely hardware dependent
 controller


CSR (Control and Status Register)
- driver writes to the CSRs to issue commands to the device and
reads CSRs to obtain completion status or error condition
- memory mapped I/O, special in/out instruction (eg) 80*86’s in/out command)
- programmed I/O (tty, modem, printer), DMA (disk)
internal buffer
 device itself
143
Disk Driver

Disk I/O handling
 convert logical disk block number into physical sector(s)
 handle read/write requests, handle interrupt

disk scheduling
 FCFS
 SSTF (Shortest Seek Time First)
 SCAN
 C-SCAN

…..

DMA (channel)
RAID
144
Terminal Driver

interactive : line discipline
 canonical mode, raw mode (stty)
cblock
process
raw queue (clists)
tty_read
canon queue
tty_write
out queue
tty driver
interrupt
xbuf
rbuf
145
CSR
in/out
General structure of Device Driver


well defined entry point
top half, bottom half
character device driver
block device driver
open
open
close
close
read
in/out
strategy
write
ioctl
in/out
size
intr
intr
mmap
 what’s the difference between character and block device driver?
146
Device Switch Table

devsw: table for registering the entry points of device drivers
struct cdevsw {
int (*d_open) ();
int (*d_close) ();
int (*d_read) ();
int (*d_write) ();
int (*d_ioctl) ();
int (*d_mmap) ();
int (*d_segmap) ();
int (*d_xpoll) ();
int (*d_xhalt) ();
struct streamtab *d_str;
struct ttytab *d_tty;
….
} cdevsw[];
struct bdevsw {
int (*d_open) ();
int (*d_close) ();
int (*d_strategy) ();
int (*d_size) ();
int (*d_xhalt) ();
….
} bdevsw[]
(Source : UNIX Internals)
147
Device Switch Table (Cont`)

Example of switch table
bdevsw
cdevsw
hd_open
hd_close
hd_strategy
con_open con_close con_read con_write con_ioctl
ht_open
ht_close
ht_strategy
tty_open
tty_close
tty_read
tty_write tty_ioctl
cd_open
cd_close
cd_strategy
ed_open
ed_close
ed_read
ed_write ed_ioctl
nulldev
nulldev
mm_read mm_write nulldev
hd_open
hd_close
hd_read hd_write nulldev
dev file
#ls -l /dev/
brw-r--r-- 0 1
brw-r--r-- 0 2
….
brw-r--r-- 0 11
brw-r--r-- 1 0
….
crw-r--r-- 1 0
crw-r--r-- 1 1
….
crw-r--r-- 5 0

hda1
hda2
hdb1
tape
tty0
tty1
rhda1
why do we access disks through character interface?
148
Device Switch Table (Cont`)

example : open
 open(“/dev/tty0”, O_RD)
proc table
fd
file table
inode
i_dev : c, 1,0
cdevsw
con_open con_close con_read con_write con_ioctl

tty_open
tty_close
tty_read
tty_write tty_ioctl
ed_open
ed_close
ed_read
ed_write ed_ioctl
nulldev
nulldev
mm_read mm_write nulldev
gd_open
gd_close
gd_read gd_write nulldev
(*cdevsw[getmajor(dev)].d_open) (dev, …)
149
Device Switch Table (Cont`)

install new device driver
 make new device driver and linking kernel

my_open(), my_read(), my_write(), my_close(), ….
 register devsw table
 make special file
# mknod /dev/mydrv [b|c] major_number minor_number
150
Device Switch Table (Cont`)

control flow
user mode
read()
kernel
queue
devsw table
wakeup
sleep
interrupt
handler
driver
IVT
device

where the requesting process is slept?
151
STREAM

full-duplex data transfer and processing path
 consists of a pair of queues
user application
STREAM head
user
kernel
W
R
W
R
STREAM module
W
R
W
R
STREAM driver
hardware
152
STREAM (Cont`)
user
user
STREAM head
STREAM head
TCP
UDP
IP
IP
token ring
ethernet
user
user
user
STREAM head STREAM head STREAM head
TCP
UDP
IP
ATM
Reusable Module
Multiplexing
153
DQDB
STREAM (Cont`)

STREAM features
 transparency among the queues
 reusable
 multiplexing
 message based communication
 virtual copying
 STREAM scheduler : priority bands
154
Part II. Detailed Study:
Linux Kernel Internals
155
Contents






why Linux?
where is everything (kernel source code) ?
kernel configure and compile
system call implementation
module programming
some important kernel date structures
156
References









M. Beck, H. Bohme, M Dziadzka, U Kunitz, R. Magnus, D. Verworner,
“Linux Kernel Internals, 2nd Ed”, Addison-Wesley, 1997
Fred Butzen, Christopher Hilton, “The LINUX Network”, The M&T
Books Slackware Series, 1998
Remy Card, etc, “the LINIX KERNEL Book”, John Wiley & Son, 1998
A. Bubini, “LINUX Device Driver”, O’REILLY, 1998
Anonymous, “Maximum Linux Security (A Hacker’s Guide To
Protecting Your Linux Server and WS)”, SAMS Publishing, 1999
http://www.linux.org/
http://www.kernel.org/
http://kldp.org/
/usr/src/linux
157
Why Linux?

freely available
 Linus Torvalds, Copyleft
 1991 version 0.01 (November 1999, version 2.2.13)
 Redhat, Debian, Slackware, Alzza
 supported many companies

Main characteristics
 multi-tasking
 multi-user access
 multi-processor
 support various architecture (80*86, sparc, mips, alpha, smp, ..)
 demand load executables
 paging
 dynamic cache for hard disk
158
Why Linux? (Cont`)

main characteristics (cont`)
 shared library
 support for POSIX 1003.1
 various formats for executable files
 true 386 protected mode
 emulating maths co-processor
 support for national keyboards and fonts
 support diverse file system (ext2, ..)
 TCP/IP, SLIP, PPP
 BSD sockets
 System V IPC
 Virtual Console
159
Why Linux? (Cont`)

drawbacks
 monolithic kernel (currently micro kernerlize in many research)
 not for beginners (for system programmers)
 not well structured (performance-oriented)

Key attraction
 ‘experimenting’ with the system (handle the kernel by yourself)
 supported many companies
 free: solution business & add on features
 thanks to the INTERNET & GNU (special thanks to Anti-MS feeling)
160
Where is everything?

Linux Operating System Structure
user level
application
System Calls Interface
Central kernel
File System
ext2fs xiafs
minix nfs
iso9660
kernel level
proc
msdos
Buffer Cache
task management
scheduler
signals
memory management
loadable modules
Peripheral Manager
block
hd
network
Network Manager
ipv4
ethernet
…….
character
cdrom isdn
scsi
pci
Machine Interface
Machine
H/W level
(Source : the LINUX KERNEL book)
161
Where is everything? (Cont`)

source structure
 based on version 2.2.5
 under development : the contents described below may be changed
ipc
kernel
lib
mm
scripts
Doc
cdrom
/usr/src/linux
driver
arch
alpha
fs
init
block
include
arm
char
net
net
802
pci
m68k
coda
asm-alpha
appletalk
pnp
mips
ext2
asm-arm
decnet
sbus
ethernet
ppc
sparc
i386
boot
kernel
lib
math-emu
mm
msdos
asm-i386
ipv6
scsi
sound
nfs
linux
unix
video
ntfs
net
sunrpc
ufs
scsi
x25
hpfs
video
162
Where is everything? (Cont`)

main subdirectory
 arch/


architecture dependent codes : arch/i386, arch/alpha, ….
arch/i386/boot/
– bootstrapping
– configure devices, memory

arch/i386/kernel/
– kernel entry point handling (trap/interrupt handling)
– context switch

arch/i386/mm/
– machine dependent memory management code
 init/



all the functions needed to start the kernel
hand-made process 0 (init_task or task[0])
fork process 1, 2, 3, ...
163
Where is everything? (Cont`)

main subdirectory
 kernel/ (arch/i386/kernel)





central section of the kernel
main system call implementation (fork, exit, etc.)
time management
scheduler
signal handling
 mm/


virtual memory interface
paging, kernel memory management
 fs/


virtual file system interface
implementations of the various file systems (ext2, nfs,...)
164
Where is everything? (Cont`)

main subdirectory
 drivers/








drivers for hardware components
drivers/block/ : block-oriented driver(hard disks)
drivers/cdrom/ : proprietary CD-ROM drives
drivers/char/ : character-oriented driver (serial ports, tty, modem, ..)
drivers/net : network cards
drivers/pci/ : PCI bus access and control
drivers/scsi/ : SCSI interface
drivers/sound/ : sound card drivers
 ipc/


classical inter-process communication
semaphores, shared memory, message queues
165
Where is everything? (Cont`)

main subdirectory
 net/


various network protocol implementations : TCP/IP, ARP, ...
code for sockets to the UNIX and Internet domains
 lib/

some standard kernel library functions (printk)
 modules/


kernel module files
modules can be added to the kernel later (insmod, rmmod)
 include/



commonly included kernel-specific header files
include/asm-i386/ : architecture-dependent header files for Intel CPU
include/linux/ : Linux kernel internal structure (task, inode)
166
Kernel Configuration and Compile

new kernel is generated in three steps
1. configure (Documentation/Configuration.help, see chapter 3 of “The
LINUX Network”)


make config (menuconfig, xconfig)
make oldconfig
2. depend

make dep (make clean:optional)
3. compile

make zImage
cf) - make zdisk (#dd bs=8192 if=$(BOOTIMAZGE) of=/dev/fd0)
- make zlilo (#cat $(BOOTIMAGE) > $(INSTALL_PATH)/vmlinuz)
/etc/lilo.conf
- #mkbootdisk --device /dev/fd0 zImage
167
Add New System Call

System Call : Control flow in Linux
Kernel
user process
sys_call_table /* arch/i386/kernel/entry.S */
do system call
real system call function
libc.a
idt_table /* arch/i386/kernel/traps.c*/
push args
save system call number
make trap
system call handler
system_call () /*arch/i386/kernel/entry.S */
catch trap through IDT
call real handler function
using sys_call_table
168
Add New System Call (Cont`)

IDT (Interrupt Descriptor Table)
 define : include/asm_i386/desc.h, arch/i386/kernel/traps.c, irq.h
 constructed while kernel initialization /*arch/i386/kernel/traps.c, irq.c*/
idt_table
0x0 divide_error()
debug()
nmi()
….
segment_not_present()
….
page_fault ()
….
0x20 timer_interrupt()
common trap handler for 80*86
FIRST_EXTERNAL_VECTOR
device interrupt handler (IRQ)
hd_interrupt()
….
SYSCALL_VECTOR
0x80 system_call()
0xff
….
169
Add New System Call (Cont`)

sys_call_table
sys_call_table
 syscall number : include/asm_i386/unistd.h
#define
#define
#define
….
#define
__NR_exit 1
__NR_fork 2
__NR_read 3
__NR_vfork 190
 sys_call_table : arch/i386/kernel/entry.S
ENTRY(sys_call_table)
.long SYMBOL_NAME(sys_ni_syscall)
.long SYMBOL_NAME(sys_exit)
.long SYMBOL_NAME(sys_fork)
.long SYMBOL_NAME(sys_read)
….
.long SYMBOL_NAME(sys_vfork)
.rept NR_syscalls-190
170
0 sys_ni_syscall()
sys_exit()
sys_fork()
sys_read()
sys_write()
…..
190 sys_vfork()
….
255
/* 0 */
/* 1 */
/* 2 */
/* 3 */
/* 190 */
Add New System Call (Cont`)

put them altogether : example of fork
Kernel
user process
main()
{
….
fork()
}
IVT
0x0 divide_error()
debug()
libc.a
….
fork()
{
….
movl 2, %eax
int $0x80
….
}
….
ENTRY(system_call)
/* entry.S */
SAVE_ALL
….
call *SYMBOL_NAME(sys_call_table)(,%eax,4)
….
nmi()
sys_call_table
….
1 sys_exit()
0x80 system_call()
….
2 sys_fork()
sys_fork()
3 sys_read ()
4 sys_write ()
/* arch/i386/kernel/process.c */
….
171
/* kernel/fork.c */
Add New System Call (Cont`)

Syntax of real system call handler in Linux
asmlinkage int sys_fork(regs) /* arch/i386/kernel/process.c */
{
return do_fork(..);
}
int do_fork(..)
/* kernel/fork.c */
{
….
/* create new process */
}
asmlinkage int sys_read(fd, buf, count)
{
…..
/* read data */
}
172
/* fs/read_write.c */
Add New System Call (Cont`)

Example: add new system call1 (too simple example)
1. kernel modification
1-1. allocate syscall number : include/asm-i386/unistd.h
#define __NR_exit 1
….
#define __NR_vfork 190
#define __NR_mysyscall 191
1-2. register sys_call_table : arch/i386/kernel/entry.S
ENTRY(sys_call_table)
…..
.long SYMBOL_NAME(sys_mysyscall)
.rept NR_syscalls-191
173
/* 191 */
Add New System Call (Cont`)
1-3. coding new system call handler
asmlinkage int sys_mysyscall()
{
printk(“Hello Linux, I’m in Kernel\n”);
}
1-4. kernel rebuild

if you make a new file, you should let it know to make utility
eg) kernel/test.c
modify the following field in Makefile on kernel directory
O_OBJS = sched.o, dma.o, fork.o, ….
… capability.o, test.o
174
Add New System Call (Cont`)
2. make user program with new system call
2-1. make user program
#define _syscall0 (type, name)
\
type name(void)
\
{\
long __res; \
__asm__ volatile (“int 0x80” \
: “=a” (__res) \
: “0” (__NR_##name)); \
__syscall_return(type, __name); \
}
/* include/asm-i386/unistd.h */
#include <linux/unistd.h>
_syscall0(int, mysyscall);
main() {
int i;
i = mysyscall();
}
2-2. make library if possible
#ar, ranlib

Just Do It (百見不如一打)
175
Add New System Call (Cont`)

add new system call2 : arguments passing
1. kernel modification
1-1 #define __NR_show_mult 192
1-2 .long SYMBOL_NAME(sys_show_mult)
/* 192 */
.rept NR_syscalls-192
1-3 asmlinkage int sys_show_mult(int x, int y, int *res) {
int error, compute;
if ((error = verify_area(VERIFY_WRITE, res, sizeof(*res)))
/* include/asm-i386/uaccess.h */
return error;
compute = x*y;
put_user(compute, res);
/* include/asm-i386/uaccess.h */
return (0);
}
cf) copy_to_user(), copy_from_user() /* include/asm-i386/uaccess.h */
176
Add New System Call (Cont`)

add new system call2 : arguments passing
2-1. make user program
#include <linux/unistd.h>
_syscall3(int, show_mult, int, x, int, y, int *, result);
main() {
int ret = 0;
show_mult(2, 5, &ret);
printf(“Result : %d * %d = %d\n”, 2, 5, ret);
}
int show_mult (int x, int y, int *result) {
long __res;
__asm__ volatile (“int 0x80”
: “=a” (__res) ,“0” (__NR_##name),
“b” ((long) (x)), “c” ((long) (y)),
“d” ((long) result)));
if (__res >= 0)
errno =- __res;
return __res;
}
/* include/asm-i386/unistd.h */
177
Add New System Call (Cont`)

add new system call3 : some general system calls
 getpid
asmlinkage int sys_getpid() {
current->pid;
NR_TASKS: number of total concurrent tasks
}
all tasks connected using double linked list (next_task, next_run)
global variable: init_task, current
task[0]: init_task, task[1]: init process
 nice
asmlinkage int sys_nice(new_priority) {
….
current->priority = newpriority ;
}
 pause
asmlinkage int sys_pause() {
current->state = TASK_INTTERUPTIBLE;
schedule();
}
178
Add New System Call (Cont`)
 fork
/* arch/i386/kernel/process.c */
sys_fork()
/* kernel/fork.c */
do_fork()
/* arch/i386/kernel/process.c */
- p = alloc_task_struct()
- task structure initialize
- copy_mm()….
- copy_thread()
- wake_up_process(p)
- return (p->pid)
copy_thread()
….
- p->tss.eax = 0;
- p->tss.eip = ret_from_fork;
/* kernel/sched.c */
/* arch/i386/kernel/entry.S */
ret_from_sys_call()
wake_up_process()
- add_to_runqueue(p);
- current->need_resched = 1
/* kernel/sched.c */
schedule()
if (schedule parent)
else (schedule child)
179
Add New System Call (Cont`)
 exit
/* kernel/exit.c */
sys_exit()
/* kernel/exit.c */
do_exit()
- sem_exit()
- exit_mmap()
- free_page_tables()
- exit_files()
- exit_thread()
….
….
- handling each child process
- current->state=TASK_ZOMBIE
- schedule()
/* kernel/signal.c */
notify_parent()
180
Add New System Call (Cont`)

Project II: add new system
 get kernel information: want to know about process id, state, process
execution time (system time and user time separately), the number of
page faults, the number of open files, and and so on
1. kernel modification
asmlinkage int sys_process_statistics(….) {
….
current->pid, min_flt, maj_flt, times.tms_utime, times.tms_stime
….
}
2. user program
181
Motivation of Module in LINUX

why do we use modules?
 Linux is a monolithic kernel



trivial modifications require kernel to be recompiled
kernel is increasing in size by adding new features
many modules occupy permanent space in memory though they are used
rarely
 module: steps toward micro-kernelized Linux




small and compact kernel
clean kernel
rapid kernel
solution business: components-based Linux
•예: backup tape driver
182
What can be Modules ?

what can be modules?
 possibly anything
 current version
file system
block device driver
character device driver
network device driver
exec domain
binary format
register_filesystem, unregister_filesystem
read_super, put_super
register_blkdev, unregister_blkdev
open, release
register_chrdev, unregister_chrdev
open, release
register_netdev, unregister_netdev
open, close
register_exec_domain, unregister_exec_domain
load_binary, personality
register_binfmt, unregister_binfmt
load_binary
….
cf: /lib/modules/x.x.x/*.o
183
How to manipulate modules?

how to manipulate modules?
 compilation
# gcc -D__KERNEL__ -D_LINUX -DMODULE -c new_module.c
Enable loadable module support (CONFIG_MODULES) [Y/n/?]
…
MSDOS fs support (CONFIG_MSDOS_FS) [M/n/y/?]
 insmod, lsmod, rmmod
#insmod fat
#lsmod
Module: #pages : Used by
fat
6
0
#rmmod fat
 kerneld: for on-demand loading

eg: mount -t msdos /dev/fd0 /mnt => transparent load fat & msdos modules
184
How to implement modules?

Module
 basic two interfaces


init_module()
cleanup_module()
kernel
register_filesystem()
module
insmod
init_module()
register_blkdev()
cleanup_module()
rmmod
register_netdrv()
sock_register()
185
How to implement modules? (Cont`)

example1 : Hello world!!
/* hello.c */
#include <linux/kernel.h>
#include <linux/module.h>
int init_module() {
printk(“Hello world!! - I’m in kernel\n”);
return 0;
}
void cleanup_module () {
printk(“Bye world - I’m in kernel\n”);
}
# gcc -D__KERNEL__ -D_LINUX -DMODULE -c hello.c
#insmod hello.o
#rmmod
186
How to implement modules? (Cont`)

example2 : simple device driver
/* time.c */
#include <linux/kernel.h>
#include <linux/module.h>
#define HOUR_MAJOR 60
#define HOUR_MINOR 0
struct file_operations time_fops = {
NULL,
time_read,
NULL, NULL, NULL, NULL,
NULL, time_open, NULL, NULL
};
int time_init() {
register_chrdev(HOUR_MAJOR, “time”, &time_fops);
printk(“time module loaded (major=%d)\n”, HOUR_MAJOR);
}
int time_read(fd, buf, size) {
…
copy_to_user(CURRENT_TIME, buf,...);
}
int init_module () {
return time_init();
}
int time_open(..) {
….
}
cleanup_module {
unregister_chrdev(HOUR_MAJOR, “time”);
printk(“time module unloaded \n”);
}
187
How to implement modules? (Cont`)

example2 : simple device driver
#gcc -D__KERNEL__ -D_LINUX -DMODULE -c time.c
#mknod
#insmod
#lsmod
Module:
time
/dev/time c 60 0
time
#pages:
1
Used by:
#cat /dev/time
/* print current time */
#rmmod time
 how can the “cat” command invoke the time_read() function ?
188
How to implement modules? (Cont`)

example2 : simple device driver
 register_blkdev()
init_module
/* include/linux/major.h */
time_init()
register_chrdev(HOUR_MAJOR, “time”, &time_fops);
register_chrdev()
- chrdevs[major].name = “time”
- chrdevs[major].fops = time_fops
189
How to implement modules? (Cont`)

example2 : simple device driver
 open
sys_open()
- get_unused_fd()
- fd_install(fd, f)
filp_open()
/* fs/namei.c */
open_namei()
- struct file initialize
- f->f_op->open()
/* fs/device.c */
time_open()
chrdev_open()
pipe_open()
blkdev_open()
socket_open()
nfs_open()
190
- filp->f_op = get_chrfops(MAJOR
(inode->i_rdev));
/* filp->f_op = chrdevs[major].fops */
- filp->f_op->open;
How to implement modules? (Cont`)

example2 : simple device driver
 read
/* fs/read_write.c */
sys_read()
- f->f_op->read
nfs_read()
pipe_read()
time_read()
tty_read()
/* fs/block_dev.c */
block_read()
191
How to implement modules? (Cont`)

example3 : system call wrapper
#include <linux/kernel.h>
#include <linux/module.h>
#include <sys/syscall.h>
#include <linux/sched.h>
#include <asm-i386/uaccess.h>
extern void *sys_call_table[];
int uid;
asmlinkage int (*original_call) (const char *, int, int);
asmlinkage int (*getuid_call) ( );
int init_module ( ) {
original_call = sys_call_table[__NR_open];
sys_call_table[__NR_open] = our_sys_open;
printk(“Spying on UID: %d\n”, uid);
getuid_call = sys_call_table[__NR_getuid];
return 0;
}
void cleanup_module ( ){
if (sys_call_table[__NR_open] != our_sys_open) {
sys_call_table[__NR_open] = original_call;
}
}
192
How to implement modules? (Cont`)

example3 : system call wrapper
asmlinkage int our_sys_open(const chat *fname, int flags, int mode) {
int i=0;
char ch;
if (uid == getuid_call() {
printk(“opened file by %d: “, uid);
do {
get_user(filename+i);
i++;
printk(“%c”, ch);
} while (ch != 0);
}
printk(“\n”);
return original_call(fname, flags, mode);
}
193
How to implement modules? (Cont`)

example4 : new file system
 design super block
 program file operations, program inode operations
 registering : register_filesystem()
#ifdef CONFIG_MINIX_FS
register_filesystem(&(struct file_system_type)
{minix_read_super, “minix”, 1, NULL});
#endif
 mount
struct file_system_type {
struct super_block *(*read_super) ();
char *name;
int requires_dev;
struct file_system_type *next;
} *file_system;
194
How to implement modules? (Cont`)

Project III
 implement your own modules make file operations



make module interface
make driver
mknod (use pseudo device such as memory)
init_module()
cleanup_module()
mydrv_init()
mydrv
mydrv_open()
mydrv_interrupt()
mydrv_release()
mydrv_out()
mydrv_read()
mydrv_write()
mydrv_ioctl()
195
How to implement modules? (Cont`)

system call for modules
 create_module


memory allocation for module (return load address)
a new element for module_list
 init_module



physical loading of requesting module (module functions become an
integral part of kernel)
relocating module functions and solving references of kernel symbols
call module specific init_module function
 delete_module
 get_kernel_syms

to get kernel symbols
196
How to implement modules? (Cont`)
 Kernel data structure for create_module()
module_list
module
module
next
ref
symtab
name
...
next
ref
symtab
name
...
size
size
references
symbol table
for this module
197
references
symbol table
for this module
Control flow of FS system call

file access under Linux /* include/linux/sched.h, fs.h */
inode
fs_struct
task structure
…
fs
files
...
count
umask
*root
*pwd
inode
file
f_mode
f_pos
f_flag
f_count
f_owner
f_inode
f_op
f_version
file_struct
count
close_on_exec
fd[0]
fd[1]
…
fd[255]
why do we need the file data structure ?
198
inode
file operation
routines
Control flow of FS system call (Cont`)

Why do we need file data structure
=> to support various type of files with single coherent interface
 open
/* fs/open.c */
sys_open()
- get_unused_fd()
- fd_install(fd, f)
/* fs/open.c */
filp_open()
/* fs/namei.c */
open_namei()
- struct file initialize
- f->f_op->open()
/* to support various file */
199
Control flow of FS system call (Cont`)
 struct file /* include/linux/fs.h */
f_next, f_prev
f_dentry
f_op
f_mode
f_pos
f_count
f_flags
f_reada, f_ramax
...
/* to access inode */
/* access type */
/* file offset */
/* reference count */
 file operation example
fs/ext2/file.c
ext2_file_lseek,
generic_file_read,
ext2_file_write
NULL, NULL,
ext2_file_ioctl
generic_file_mmap
NULL, …….
fs/ufs/file.c
ufs_file_lseek,
generic_file_read,
ufs_file_write
NULL, NULL,
NULL,
generic_file_mmap
NULL, …….
fs/nfs/file.c
NULL,
nfs_file_read,
nfs_file_write
NULL, NULL,
NULL,
nfs_file_mmap
nfs_file_open, ……
 where is create()?
200
include/linux/fs.h
lseek()
read()
write()
readdir()
poll()
ioctl()
mmap()
open()
flush()
release()
fsync()
fasync()
…..
fs/pipe.c
pipe_lseek, pipe_read,
pipe_write
NULL, pipe_poll,
pipe_ioctl,
NULL,
pipe_rdwr_open, ...
/* net/socket.c */
sock_lseek
sock_read
sock_write
NULL
sock_poll
sock_ioctl
NULL
sock_no_open
….
fs/device.c
NULL,
NULL,
NULL,
NULL, NULL,
NULL,
NULL
blkdev_open, …….
Control flow of FS system call (Cont`)
 open
/* fs/open.c */
System call layer
sys_open()
- get_unused_fd()
- fd_install(fd, f)
/* fs/open.c */
filp_open()
- struct file initialize
- f->f_op->open()
/* fs/namei.c */
open_namei()
VFS layer
Specific File layer
iget(), bread()
pipe_rdwr_open()
sock_no_open()
nfs_file_open()
blkdev_open()
chrdev_open()
201
Control flow of FS system call (Cont`)
 read
System call handling
layer
/* fs/read_write.c */
sys_read()
- f->f_op->read
sock_read()
block_read()
pipe_read()
nfs_file_read()
VFS layer
/* mm/filemap.c */
generic_file_read()
tty_read()
Specific File layer
- try to find page in page cache, if (hit) OK.
- get_free_page()
- inode->i_op->readpage()
202
Control flow of FS system call (Cont`)

inode structure in Linux /* include/linux/fs.h, ext2_fs_i.h */
inode
task
….
fd[]
….
file
….
f_dentry
….
f_pos
f_op
dentry
d_inode
inode operation
routines
File specific information
….
i_ino
i_dev
i_count
i_mode
i_nlink
i_uid, gid
……
i_atime, ...
i_rdev
i_op
i_data[15]
i_flags
i_….
203
device driver
Control flow of FS system call (Cont`)
 inode operation example
...
i_op
...
fs/ext2/file.c
ext2_file_operations,
NULL, NULL,
NULL, NULL,
...
generic_readpage
NULL
ext2_bmap,
…….
include/linux/fs.h
def_file_operation
create(), lookup()
link(), unlink(), symlink()
mkdir(), rmdir()
mknod(), rename(),
readlink(), followlink()
readpage(), writepage()
bmap(), truncate(),
…….
fs/ufs/file.c
fs/nfs/file.c
ufs_file_operations,
NULL, NULL,
NULL, NULL,
...
generic_readpage
NULL
ufs_bmap,
…….
nfs_file_operations,
NULL, NULL,
NULL, NULL,
...
nfs_readpage
nfs_writepage
NULL
…….
204
fs/dos/files.c
dos_file_operations,
NULL, NULL,
NULL, NULL,
…
dos_readpage,
dos_writepage,
NULL,
…….
fs/pipe.c
rdwr_pipe_fops,
NULL, NULL,
NULL, NULL,
...
fs/device.c
def_blk_fops,
NULL, NULL,
NULL, NULL,
...
Control flow of FS system call (Cont`)
 read
System call handling
layer
/* fs/read_write.c */
sys_read()
- f->f_op->read
sock_read()
pipe_read()
VFS layer
block_read()
/* mm/filemap.c */
generic_file_read()
tty_read()
Specific File layer
- try to find page in cache, if (hit) OK.
- inode->i_op->readpage()
nfs_readpage()
/* fs/buffer.c */
/* fs/ext2/inode.c */
ext2_bmap()
/* fs/ufs/inode.c */
ufs_bmap()
generic_readpage()
dos_readpage()
Specific FS layer
coda_readpage()
/* driver/block/ll_rw_blk.c */
ll_rw_block()
/* driver/block/hd.c */
hd_request
205
Device Driver layer
Device Driver Implementation in Linux

data structure
 blkdevs, chrdevs for devsw
 blk_dev_struct for block driver only
file_operations
/* fs/devices.c */
lseek
read, write, readdir
poll, ioctl, mmap,
open, flush, release
fsync, fasync
…..
struct device_struct {
name;
fops;
} chrdevs[], blkdevs[];
/* include/linux/blkdev.h */
struct blk_dev_struct {
request_fn;
queue;
request;
...
} blk_dev[];
206
Driver Implementation in Linux (Cont`)

buffer_head
b_dev
b_blocknr
b_state
b_count
b_size
...
b_next
b_data
data structure (cont`)
chrdevs[]
name
fops
file_operations
blkdev
request
rq_status
rq_dev
cmd
…
sem
bh
tail
next
request_fn
current_request
207
request
rq_status
rq_dev
cmd
…
sem
bh
tail
next
request
Driver Implementation in Linux (Cont`)

Example of structure of driver: IDE disks
hd_init()
hd_open()
hd_interrupt()
hd_release()
hd_out()
driver/block/hd.c
hd_request()
check_status()
hd_ioctl()
NULL,
block_read,
block_write
NULL, NULL,
hd_ioctl,
NULL,
hd_open,
NULL
hd_release,
block_fsync
struct file_operations hd_ops
208
Driver Implementation in Linux (Cont`)
 major number
Major
0
1
2
3
4
5
6
7
8
9
………
23
….
/* include/linux/major.h */
Character devices
Block devices
mem
RAM disk
floppy (fd*)
IDE hard disk (hd* )
terminal
terminal & AUX
Parallel Interface
virtual console (vcs*)
SCSI hard disk (sd*)
SCSI tapes (st*)
Mitsumi CD-ROM (mcd*)
209
Driver Implementation in Linux (Cont`)

initialization of disk driver
 register_blkdev()
init_module
init process
/* driver/block/hd.c */
hd_init()
/* include/linux/major.h */
- register_blkdev(HD_MAJOR, “hd”, &hd_fops);
- blk_dev[HD_MAJOR]. request_fn = hd_request
/* fs/devices.c */
register_blkdev()
- blkdevs[major].name = device name
- blkdevs[major].fops = fops
210
Driver Implementation in Linux (Cont`)

disk driver open
/* fs/open.c */
sys_open()
- get_unused_fd()
- fd_install(fd, f)
/* fs/open.c */
filp_open()
/* fs/namei.c */
open_namei()
- struct file initialize
- f->f_op->open()
/* driver/block/hd.c */
/* fs/device.c */
hd_open()
blkdev_open()
pipe_open()
chrdev_open()
socket_open()
nfs_open()
211
- filp->f_op = get_blkfops(MAJOR
(inode->i_rdev));
/* filp->f_op = blkdevs[major].fops */
- filp->f_op->open; /* hd_open */
Driver Implementation in Linux (Cont`)

disk driver read
/* fs/read_write.c */
sys_read()
- f->f_op->read
/* mm/filemap.c */
nfs_read()
pipe_read()
generic_file_read()
tty_read()
/* fs/block_dev.c */
block_read()
- getblk(); /* buffer header */
/* driver/block/ll_rw_blk.c */
ll_rw_block()
make_request()
- request structure initialize
add_request()
- call blk_dev[major].request_fn
/* driver/block/hd.c */
hd_request()
212
- hd_out()
Driver Implementation in Linux (Cont`)

queue and requests (similar to message queue)
 requests are sorted by sector number
 inb, outb
/* include/linux/blkdev.h */
struct blk_dev_struct {
request_fn;
queue;
request;
...
} blk_dev[];
bread
block_read
struct request {
rq_status
rq_dev
cmd /* R/W */
error
sector, nr_sector
buffer, bh
sem
next
...
}
request_fn
hd_request
queue
buffer
cache
req
req
ll_rw_block
make_request
213
req
block
device
driver
do I/O
Driver Implementation in Linux (Cont`)

various disks and partitions
 gendisk
gendisk_head
gendisk
gendisk
8
major
“sd”
name
minor_shift
max_p
part
….
real_devices
next
214
3
major
“ide0”
name
minor_shift
hd_struct
max_p
part
start_sect
….
nr_sects
real_devices
...
next
...
start_sect
nr_sects
Driver Implementation in Linux (Cont`)

tty driver
 register_chrdev()
init_module
init process
driver/char/tty_io.c
tty_lseek,
tty_read,
tty_write
NULL,
tty_poll
tty_ioctl,
NULL,
tty_open,
NULL
tty_release,
NULL
tty_afsync
/* driver/block/hd.c */
tty_init()
/* include/linux/major.h */
- register_chrdev(TTY_MAJOR, “tty”, &tty_fops);
/* fs/devices.c */
register_chrdev()
- blkdevs[major].name = device name
- blkdevs[major].fops = fops
215
Driver Implementation in Linux (Cont`)

Example of network driver : 3c509
 different from disk and tty driver

not directly interface with VFS
/* driver/net/3c509.c */
/* driver/net/3c509.c */
el3_init()
ip_output()
ip_rcv()
el3_open()
el3_start_xmit()
el3_out()
el3_stop()
el3_interrupt()
el3_release()
216
Driver Implementation in Linux (Cont`)

Example of network driver : 3c509
/* include/linux/netdevices.h */
struct device {
name
mem_end, mem_start
base addr /* port number */
…
init, destructor
….
device_addr
qdisc /* sk_buff */
….
open, stop
hard_start_xmit, hard_header
…
irq
}
init_module() in 3c509
/* driver/net/3c509.c*/
/* register_netdev() */
init port, irq, …
make dev structure
dev->init=el3_init
dev->open=el3_open
dev->hard_start_xmit =
el3_start_xmit
...
el3_open()
….
request_irq(dev->irq, el3_interrupt
217
Task Scheduling

LINUX scheduling
 clock tick is 10msec, time quantum is 10 clock ticks
 support REAL-TIME task
 variables for scheduling in task structure

p_policy : task type /* include/linux/sched.h */
– SCHED_FIFO, SCHED_RR, SCHED_OTHER

p_priority
– set to DEF_PRIORITY (20) /* include/linux/sched.h */
– can be changed using sys_nice() or sys_setpriority();

p_counter
– decrease each clock tick
– counter = priority, when counter of all task is zero


need_resched : need re-scheduling when return from syscall or interrupt
rt_priority
– set using sched_setscheduler(pid, policy, sched_param) system call
– used to set real time tasks (static priority)
218
Task Scheduling (Cont`)
 schedule() function /* kernel/sched.c */
need_resched
sleep_on
schedule
- schedule real time task first (rt_priority)
- select a task which has highest values of
counter + priority (using goodness function)
give advantage to the task which run this_cpu
give slight advantage to the task which has mm object
- if (p_counter == 0) for all task
p_counter = p_priority
- context switch : switch_to (current, next) /* arch/i386/kernel/process.c */
219
Task Scheduling (Cont`)
 Example of scheduling

3 tasks
millisecond
T1
T2
T3
p_pri p_count.
p_pri p_count.
p_pri p_count.
0
20
20
20
20
20
20
10
20
10
20
20
20
20
20
20
10
20
10
20
20
30
20
10
20
10
20
10
40
20
0
20
10
20
10
20
0
20
0
20
10
20
20
20
20
20
20
220
Signal

a mechanism to inform an asynchronous event to process
 types of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, ….
 action : abort, exit, ignore, stop, user level catch function
void sig_handler(signo)
int signo;
{
signal (SIGUSR1, sig_handler);
printf(“received signal %d\n”, signo);
…..
}
/* reinstall */
/* handle the signal */
main ()
{
signal (SIGUSR1, sig_handler);
….
for ( ; ; )
pause();
/* install the handler */
}
 what’s the difference among interrupt, trap, and signal?
221
Signal (Cont`)
 register signal handler (signal catch function )
 send signal
 signal detection : state transition from kernel running to user running
 call signal handler
 variables for signal in task structure



int sigpending : is signal received or not?
struct signal_struct *sig
sigset_t signal, blocked
typedef struct {
unsigned long sig[_NSIG_WORDS];
} sigset_t; /* asm-i386/signal.h */
struct sigaction /* asm-i386/signal.h */
struct signal_struct /* sched.h */
count
action[_NSIG]
siglock
222
sa_handler
sa_flags
sa_restorer
sa_mask
Signal (Cont`)
 register signal catch function
task
….
sig
signal, blocked
sigpending
….
signal_struct
count
action[_NSIG]
siglock
sigset_t
….
63
sigaction
sa_handler
sa_flags
sa_restorer
sa_mask
sigset_t
….
0
/* kernel/signal.c */
sys_signal(sig, handler)
do_sigaction(sig, new_sa, old_sa)
223
63
0
Signal (Cont`)
 send signal
task
….
sig
signal, blocked
sigpending
….
signal_struct
count
action[_NSIG]
siglock
sigset_t
….
63
sigaction
sa_handler
sa_flags
sa_restorer
sa_mask
sigset_t
….
0
63
0
/* kernel/signal.c */
sys_kill(pid,sig)
kill_proc_info(sig, info, pid)
send_sig_info(sig, info, *t)
sigaddset(t->signal, sig);
t->sigpending = 1;
224
Signal (Cont`)
 signal handling
task
….
sig
signal, blocked
sigpending
….
signal_struct
count
action[_NSIG]
siglock
sigaction
sa_handler
sa_flags
sa_restorer
sa_mask
/* arch/i386/kernel/entry.S */
if (current->sigpending)
do_signal();
/* arch/i386/kernel/signal.c */
do_signal(regs, oldset)
signr = dequeue_signal()
handle SIG_IGN
or SIG_DFL
sigset_t
….
63
0
handle_signal()
sigset_t
….
63
setup stack frame
for signal handler
0
225
Signal (Cont`)
 signal handling: state of stack for handling signal
memory
stack
memory
stack
- return address
- arguments
- return address
- arguments
- return address
to kernel
- return address
to sighandler
- arguments
226
Thread

Motivation (golf course)
 Possibility of parallel processing
 process is too heavy
process model
address space
P
P
P
CPU
P
P
process
time
(Source : UNIX internals)
227
Thread (Cont`)

thread model
address space
thread model
CPU
thread
time
(Source : UNIX internals)


task : a set of thread and a collection of resources (passive)
thread : hardware context, stack, thread information (id, scheduling, ..)
228
Thread (Cont`)

types of threads
 kernel thread
 LWP (lightweight process) : a kernel supported user thread
 user thread : C-thread, P-thread
U
user level scheduler
U
U
U
L
L
K
K
U
U
process (or task)
L
K
K
K
thread scheduler
CPU
CPU
229
Thread (Cont`)

threads in Linux
 struct thread: currently only one in task structure
 sys_clone()


fully share the address context such as page directory
under developing
 can use user level thread (P thread)




/usr/include/pthread.h
pthread_create()
pthread_join()
pthread_mutex_init()
230
Thread (Cont`)

Example of thread programming
/* gcc -lpthread */
#include <pthread.h>
...
int main(int argc, char *argv[]) {
pthread_t *thread;
void *retval;
int cpu, i;
DATA *A;
volatile double s = 0;
pthread_mutex_t s_lock;
typedef struct {
double volatile *p_s;
pthread_mutex_t *p_s_lock;
int n;
} DATA;
if (argc != 0) {
printf(“USAGE: %s, CPU number”, argv[0]);
exit(1);
}
cpu = atoi(argv[1]);
thread = (pthread_t *)calloc(cpu, sizeof(pthread_t));
A = (DATA *) calloc(cpu, sizeof(DATA));
231
#define L 9
double x[L], y[L];
Thread (Cont`)

Example of thread programming
for (i=0; i<L; i++)
x[i] = y[i] = i;
pthread_mutex_init(&s_lock, NULL);
void *SMP_scalprod(void *arg)
{
register double localsum;
long i;
DATA D = *(DATA *)arg;
for (i=0; i<cpu; i++) {
A[i].n=i; /* start offset */
A[i].p_s=&s;
A[i].p_s_lock=&s_lock;
pthread_create(&thread[I], NULL,
SMP_scalprod, &A[i]);
}
localsum = 0.0;
for (i=D.n; i<L; i+=cpu)
localsum += x[i]*y[i];
pthread_mutex_lock(D.p_s_lock);
*(D.p_s) += localsum;
pthread_mutex_unlock(D.p_s_lock);
for (i=0; i<cpu; i++)
pthread_join(thread[i], &retval);
return (NULL);
printf(“results = %f\n”, s);
}
}
232
Data Structure for Virtual Memory

Linux virtual memory structure for each task
 global view /* include/linux/sched.h, mm.h, include/asm-i386/page.h */
task_struct
mm
mm_struct
vm_area_struct
map_count
pgd
vm_end
vm_start
vm_flags
…..
mmap
31
11 0
PFN
page directory
vm_file
vm_offset
vm_ops
vm_next
vm area
(data or parts of data)
vm_area_struct
vm_end
vm_start
vm_flags
…..
vm_file
vm_offset
vm_ops
vm_next
233
vm_area
(text)
Data Structure for Virtual Memory (Cont`)
 struct mm_struct
include/linux/sched.h
struct mm_struct {
struct vm_area_struct *mmap;
struct vm_area_struct *mmap_avl, *mmap_cache;
pgd_t *pgd;
atomic_t count; int map_count;
struct semaphore mmap_sem;
unsigned long context;
unsigned long start_code, end_code, start_data;
unsigned long end_data, start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm, def_flags;
unsigned long swap_cnt, swap_address;
void *segment;
}
include/asm-i386/page.h
typedef struct {unsigned long pgd;} pgd_t;
234
kernel
env_end
arg_end
arg_start
start_stack
stack
brk
end_data
end_code
start_code
bss
data
text
Data Structure for Virtual Memory (Cont`)
 pgd_t
task_struct
mm_struct
mm
map_count
pgd
mmap
31
22 21 12 11
0
DIR PAGE offset
31
11 0
31
11 0
PFN
CR3
11
PFN
PFN
page directory
31
page table
235
0
offset
physical address
Data Structure for Virtual Memory (Cont`)
 struct vm_area_struct

need to handle segments (or parts of segment) differently: text/data, share/private
include/linux/mm.h
Virtual Memory Area
struct vm_area_struct {
struct mm_struct *vm_mm;
unsigned long vm_start, vm_end;
struct vm_area_struct *vm_next
pgprot_t vm_page_prot;
unsigned short vm_flags;
short vm_avl_height;
struct vm_area_struct *vm_avl_left;
struct vm_area_struct *vm_avl_right;
struct vm_area_struct *vm_next_share;
PAGE_SHARED (COPY,
READONLY, KERNEL)
struct vm_operations_struct *vm_ops;
unsigned long vm_offset;
struct file *vm_file;
unsigned long vm_pte; /* for SVR4 SM */
}
236
•open(vm_area)
•close(vm_area)
•do_mmap(file, addr, len,
prot, flags, off)
•unmap()
•protect()
•nopage()
•wppage()
•swapout()
•swapin()
Data Structure for Virtual Memory
 execve (final) : usually demand paging under Linux
task_struct
mm
mm_struct
vm_area_struct
map_count
pgd
vm_end
vm_start
vm_flags
…..
vm_file
vm_offset
vm_ops
vm_next
a.out (ELF format)
p_type
p_offset
p_vaddr
p_filesz
p_memsz
p_flags
e_ident
…
e_phnum
mmap
physical
header1
physical
header2
……
code
data
…….
vm area
vm_area_struct
vm_end
vm_start
vm_flags
…..
open(vm_area),
close(vm_area)
do_mmap(file, addr, len,
prot, flags, off)
unmap()
protect()
nopage(), wppage()
…..
237
vm_file
vm_offset
vm_ops
vm_next
vm_area
Data Structure for Virtual Memory (Cont`)
 struct vm_area_struct: AVL (Adelchild-Velskii and Landis) tree
vm_area_struct
40007000
0804b000
0804a000
40087000
40009000
40005000
08053000
40008200
c0000000
400b9000
(Source : the LINUX KERNEL book)
238
Polling & Interrupt

polling mode
#define LP_B(minor) lp_table[(minor)].base /* IO address */
#define LP_S(minor) inb_p(LP_B((minor)+1) /* status port */
#define LP_CHAR(minor) lp_table[(minor).chars
/* busy timeout */
static int lp_char_polled(lpchar, minor)
{
int status = 0;
int count = 0;
….
status=LP_S(minor);
while ((status & LP_PBUSY) && count < LP_CHAR(minor)) {
count++;
if (need_resched)
schedule();
status=LP_S(minor);
};
….
do timeout error handling if necessary (off-line, out of paper, …)
outb_p(lpchar, LP_B(minor));
…
}
239
Polling & Interrupt (Cont`)

interrupt mode
lp_init()
{
….
request_irq(LP_IRQ, lp_interrupt, 0, “PRINTER”);
….
}
static int lp_char(lpchar, minor) {
…
if(…)
outb_p(lpchar, LP_B(minor));
else
interruptible_sleep_on(&lp->lp_wait_q);
...
}
lp_interrupt(int irq, struct pt_regs *regs)
{
….
wake_up_interruptible(&lp->lp_wait_q);
….
}
240
Polling & Interrupt (Cont`)
 Interrupt handling under Linux /* arch/i386/kernel.irq.h irq.c */
Interrupt_descriptor[]
0
1
status
handler
action
depth
2
status
handler
action
depth
irqaction
handler
flags
name
dev_id
….
next
irqaction
handler
flags
name
dev_id
….
next
241
irqaction
handler
flags
name
dev_id
….
next
Polling & Interrupt (Cont`)
 default IRQ of ISA PC
IRG
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* arch/i386/kernel.irq.h irq.c */
Assignment
System timer
Keyboard controller
Second IRQ controller
Serial port 1 (COM1)
Serial port 2 (COM2)
Line printer 2 (LPT2)
Floppy-disk controller (controls two disks)
Line printer 1 (LPT1)
Real-time clock
Redirected IRQ2
Unused
Unused
Motherboard (PS/2) mouse port
Mathematics coprocessor
Hard-disk (IDE) controller 1 (controls two disks)
Hard-disk (IDE) controller 2 (controls two disks)
242
Bottom Half Handling

What is bottom half
 to handle long jobs during interrupt handling
 top half : request_irq
 bottom half : mark_bh(), init_bh() with bh_base data structure
bh_mask_count[32];
struct bh_struct {
void (*routine)();
void *data;
} bh_base[32];
enum {
TIMER_BH,
CONSOLE_BH,
…
KEYBOARD_BH,
…
}
243
Bottom Half Handling (Cont`)

example of bottom half
kbd_init()
{
….
request_irq(KEYBOARD_IRQ, kbd_interrupt, 0, “KBD”);
bh_base[KEYBOARD_BH].routine = kbd_bh;
….
}
kbd_interrupt(int irq, struct pt_regs *regs)
{
….
mark_bh(KEYBOARD_BH);
….
}
kbd_bh() /* called from ret_from_syscall */
{
do KBD interrupt handling
}
244
Bottom Half Handling (Cont`)

timer handling
 To deal with some jobs which is required to be invoked
at specific time
struct timer_struct {
unsigned long expires;
void (*fn)(void)
} timer_table[];
init_timer()
add_timer()
del_timer()
245
Network in Linux

Network implementation
 one of the basic demands of an operating system
 applications

ftp, telnet, rlogin, NFS, e-mail, News
 protocol

TCP/IP, OSI, IPX (developed by Novell), SNA, appletalk, X.25
 devices

Ethernet(eth0, eth1), SLIP(sl0), PLIP (plip0)
246
Socket interface

Socket interface /* net/socket.c */
 virtual interface
 to support various protocol family

UNIX, INET, X25, IPX, APPLETALK, …
 to support various

Stream, Datagram, Raw, Reliable Delivered Message, ...
 socket(), bind(), connect(), listen(), accept()
 read(), write()
 send(), sendto(), recv(), recvfrom()
247
Layer model

layer structure of a network
BSD socket
INET socket
TCP
UDP
IP
PLIP
SLIP
parallel
port
serial
port
ETHERNET
Ethernet
card
248
ARP
Layer model (Cont`)

Encapsulation
data
TFTP
data
header
TFTP message
UDP
header
Ethernet
header
TFTP
data
header
UDP message
IP
header
UDP
TFTP
header
header
IP packet
data
IP
header
UDP
header
data
TFTP
header
Ethernet
trailer
Ethernet frame
 Details of each structure can be found in “The LINUX NETWORK” and
“UNIX network programming”
249
Layer model (Cont`)

Details of TCP/IP protocol
Ethernet frame
Destination ethernet
address
Source ethernet
address
Protocol
Data
Checksum
IP packet
Length
Protocol
Checksum
Source IP
address
Destination IP
address
Data
TCP message
Source TCP
address
Destination
TCP address
250
SEQ
ACK
Data
Important data structure

important data structure
VFS layer
struct file_operations
BSD socket layer
struct net_proto_family /* include/linux/net.h */
struct socket /* include/linux/net.h */
/* include/linux/fs.h */
inet layer
struct sock /* include/net/sock.h */
struct proto_ops /* include/linux/net.h */
transport layer
struct tcp_opt /* include/net/sock.h */
struct proto /* include/net/sock.h */
network layer
struct tcp_func /* include/net/tcp.h */
struct packet_type /* include/linux/netdevice.h */
device layer
struct device /* include/net/netdevice.h */
251
struct sk_buff
/* include
/linux/sk_buff.h */
Important data structure (cont`)

socket data structure
task
….
fd[]
….
/* include/linux/net.h */
file
….
f_dentry
….
f_pos
f_op
dentry
d_inode
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
INET, UNIX, IPX, X25, ..
252
/* include/net/sock.h */
struct sock {
...
}
/* include/linux/net.h */
struct proto_ops {
family
dup, release,
bind, connect,
accept, listen,
...
getsockops
setsockops
sendmsg
recvmsg
}
/* for INET operation */
Important data structure (cont`)

sock data structure
/* include/net/sock.h */
struct tcp_opt {
tcp_header_leng
rcv_next, snd_next,
/* sequence, error
handling information */
….
tcp_func
...
}
/* include/net/tcp.h */
struct tcp_func {
queue_xmit
send_check
….
}
/* for IP operation */
/* include/net/sock.h */
struct sock {
next, prev
daddr, dport
rcv_saddr, sport
...
rmem_alloc
receive_queue /* sk_buff */
wmem_alloc
send_queue
...
pair /* struct sock */
proto /* struct proto */
tp_pinfo
dst_cache /* struct dst_entry */
...
}
253
/* include/net/sock.h */
struct proto {
next, prev
close, bind, retransmit
connect, accept
…
sendmsg, recvmsg
…
name
}
/* for TCP or UDT operations */
/* include/net/dst.h */
struct dst_entry {
next
….
struct device *dev;
struct hh_cache *hh;
(*input)
(*output)
…
}
/* for device operation */
Important data structure (cont`)

network device data structure
/* include/net/sock.h */
struct sock {
...
dst_cache
...
}
/* include/linux/netdevices.h */
struct hh_cache {
hh_refcnt
hh_type
hh_output
…
}
/* for abstract device
operation */
/* include/net/dst.h */
struct dst_entry {
….
*dev;
*hh;
(*input)
(*output)
…
}
254
/* include/linux/netdevices.h */
struct device {
name
mem_end, mem_start
base addr /* port number */
irq
…
init, destructor
….
device_addr
Qdisc /* sk_buff */
….
open, stop
hard_start_xmit, hard_header
...
}
/* for actual network device operation */
Important data structure (cont`)

sk_buff data structure
 for virtual copy
struct sock
/* include/linux/sk_buff.h */
struct sk_buff {
next, prev
struct sock *sk;
….
dev
/* TP layer header */
union { th, uh, icmph, …} h;
/* Network layer header */
union { iph, ipv6h, arph, ..} nh
/* Data Link header */
union { ethernet, raw} mac;
struct dst_entry *dst;
…
data, head, tail, len
…
}
sk_buff
headers
data
...
sk_buff
headers
data
...
struct device
sk_buff
headers
data
...
255
Socket Create

socket create
/* include/linux/socket.h */
AF_UNIX, AF_INET, AF_IPX, ...
/* net/socket.c */
sys_socket(family, type, protocol)
SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ...
sock_create()
sock_alloc()
net_families[family]->create()
256
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
Socket Create (cont`)

protocol family registration
 family
/* include/linux/socket.h */
AF_UNIX, AF_INET, AF_IPX, ...
 registration
/* net/ipv4/af_inet.c */
struct net_proto_family inet_family_ops = {
PF_INET,
inet_create
}
/* include/linux/net.h */
struct net_proto_family {
family
create()
authentication
encryption, encrypt_net
}
struct net_proto_family net_familiese[];
/* net/socket.c */
sock_register(net_proto_family *ops)
{
...
net_familiese[ops->family] = ops;
}
inet_proto_init()
{
…
sock_register(inet_family_ops)
...
}
/* net/unix/af_unix.c */
/* net/ipx/af_ipx.c */
257
Socket Create (cont`)

/* include/linux/socket.h */
AF_UNIX, AF_INET, AF_IPX, ...
socket create
/* net/socket.c */
sys_socket(family, type, protocol)
SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ...
sock_create()
sock_alloc()
net_families[family]->create()
unix_create()
/* include/net/sock.h */
struct sock {
...
prot
net_pinfo
tp_pinfo
socket
sk_buff
….
}
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
inet_create()
sk_alloc()
switch (type)
sock->ops=&inet_stream_ops
or sock->ops=&inet_dgram_ops
…
sk->prot = &tcp_prot
258
Socket Create (cont`)

socket create
/* net/socket.c */
sys_socket(family, type, protocol)
sock_create()
get_fd()
get_empty_filp()
file->f_op=&socket_file_ops
associate d_inode with socket structure
/* net/socket.c */
struct file_operations socket_file_ops = {
sock_lseek
sock_read
sock_write
NULL /* readdir */
sock_poll
sock_ioctl
NULL /* mmap */
sock_no_open
NULL /* flush */
sock_close
NULL /* fsync */
sock_fasync
}
259
Socket Create (cont`)

after socket creation
task
….
fd[]
….
file
….
f_dentry
….
f_pos
f_op
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
dentry
VFS layer
d_inode
INET layer
/* include/net/sock.h */
struct sock {
next, prev
daddr, dport
rcv_saddr, sport
...
rmem_alloc
receive_queue /* sk_buff */
wmem_alloc
send_queue
...
pair /* struct sock */
prot /* struct proto */
tp_pinfo
dst_cache /* struct dst_entry */
...
}
TCP layer
IP layer
260
Driver
layer
Send Data

sending data through socket
 compare with FS control flow…, that is a piece of pizza
/* net/ipv4/af_inet.c */
struct proto_ops inet_stream_ops = {
PF_INET
sock_no_dup
inet_release
inet_bind
inet_stream_connect
sock_no_socketpair
inet_accept
inet_getname
inet_poll
inet_ioctl
inet_listen
inet_shutdown
inet_getsockopt
inet_setsockopt
sock_no_fcntl
inet_sendmsg
inet_recvmsg
}
/* fs/read_write.c */
sys_write()
f->f_op->write
/* net/socket.c */
sock_write()
socki_lookup(d_inode)
make msg
sock_sendmsg()
sock->ops->sendmsg
/* net/ipv4/af_inet.c */
inet_sendmsg()
sk->prot->sendmsg
261
Send Data (cont`)

sending data through socket
/* net/ipv4/af_inet.c */
inet_sendmsg()
sk->prot->sendmsg
/* net/ipv4/tcp.c */
tcp_v4_sendmsg()
tcp_do_sendmsg()
copy data from user to sk_buff
/* net/ipv4/tcp_output.c */
tcp_send_skb()
tcp_transmit_skb()
make tcp header
sk->tp_pinfo.af_tcp.af_specific
->queue_xmit(skb)
262
/* net/ipv4/tcp_ipv4.c */
struct proto tcp_proto = {
netxt, prev
tcp_close
tcp_v4_connect
tcp_accept
NULL /* retrasmit */
tcp_write_wakeup
tcp_read_wakeup
tcp_poll
tcp_ioctl
tcp_v4_init_sock
tcp_v4_destroy_sock
tcp_shutdown
tcp_getsockopt
tcp_setsockopt
tcp_v4_sendmsg
tcp_recvmsg
…
“TCP”
...
}
Send Data (cont`)

sending data through socket
/* net/ipv4/tcp_output.c */
tcp_transmit_skb()
sk->tp_pinfo.af_tcp.af_specific
->queue_xmit(skb)
/* net/ipv4/ip_output.c */
ip_queue_xmit()
build IP header
fragment handling
call ip_route_output()
/* dst_cache.output =
ip_output in ip_route_output */
sk->dst_cache->output()
/* net/ipv4/ip_output.c */
ip_output()
ip_finish_output(skb)
263
/* net/ipv4/tcp_ipv4.c */
struct tcp_func ipv4_specific = {
ip_queue_xmit
tcp_v4_send_check
tcp_v4_rebulid_header
tcp_v4_conn_request
tcp_v4_sync_recv_sock
tcp_v4_get_sock
sizeof(struct iphdr)
ip_setsockopt
ip_getsockopt
v4_addr2sockaddr
sizeof(struct sockaddr_in)
}
sk_alloc() => tcp_v4_sock_init()
tcp_v4_sock_init() {
…
sk->tp_pinfo.af_tcp.af_specific=&ipv4_specific
..
}
Send Data (cont`)

/* include/linux/netdevices.h */
struct hh_cache {
hh_refcnt
hh_type
hh_output
…
}
sending data through socket
/* include/net/ip.h */
ip_finish_output()
hh->hh_output(skb)
/* net/core/dev.c */
dev_queue_xmit()
hh->output =
neigh_ops->output =
dev_queue_xmit
/* net/ipv4/arp.c*/
input pkt into dev->qdisc
dev->hard_start_xmit()
/* driver/net/3c509.c */
el3_start_xmit()
make ethernet frame
send frame using inb(), outb(), ...
init_module() in 3c509
/* driver/net/3c509.c*/
init port, irq, …
make dev structure
dev->open=el3_open
dev->hard_start_xmit =
el3_start_xmit
...
264
struct device {
name
rmem_end, rmem_start
mem_end, mem_start
base addr
irq
…
init, destructor
….
device_addr
qdisc
….
open, stop
hard_start_xmit,
hard_header
...
}
Send Data (cont`)

sending data through socket
struct sock
struct device
...
qdisc
...
...
send queue
...
sk_buff
headers
data
...
sk_buff
headers
data
...
sk_buff
headers
data
...
Protocol Layer
Device Layer
265
Send Data (Cont`)

Sending all together (TCP/IP & Ethernet)
cf) compare with the control flow of FS, it’s too terrible (FS is a piece of cake)
VFS
BSD socket
inet socket
TCP
/* fs/read_write.c */
sys_write()
/* net/socket.c */
sock_write()
/* net/ipv4/af_inet.c */
inet_sendmsg()
/* net/ipv4/tcp_output.c */
tcp_send_skb()
/* net/ipv4/ip_output.c */
IP
Device
ip_queue_xmit()
/* driver/net/3c509.c */
el3_start_xmit()
Linux kernel
266
Receive Data

receiving data through socket
/* net/ipv4/ip_input.c */
ip_local_deliver()
ip_forward(), ip_defrag()
skb->dst->input()
/* dst.ipput =
ip_local_deliver in ip_route_input() */
/* net/ipv4/ip_input.c */
ip_rcv()
make sk_buff in device structure
ptype->func()
/* net/core/dev.c */
net_bh()
/* include/linux/netdevice.h */
struct packet_type {
type
dev
func
….
}
/* net/ipv4/ip_output.c */
struct packet_type
ip_packet_type = {
ETH_P_IP, NULL,
ip_rcv,
...
}
mark_bh(NET_BH)
/* driver/net/3c509.c */
el3_interrupt()
el3_open()
….
request_irq(dev->irq, el3_interrupt
267
Receive Data (cont`)

receiving data through socket
tcp_data_queue() /* sk_buff into sk */
wake up process
tcp_data()
check consistency, …
tcp_data()
/* net/ipv4/tcp_input.c */
tcp_rcv_state_process()
call tcp_rcv_established
or call tcp_rcv_state_process
/* net/ipv4/tcp_ipv4.c */
tcp_v4_rcv()
tcp_v4_do_rcv()
ipprot->handler()
/* net/ipv4/ip_input.c */
ip_local_deliver()
268
/* include/net/protocol.h */
struct inet_protocol {
handler
err_handler
...
name
}
/* net/ipv4/protocol.c */
struct inet_protocol
tcp_protocol {
tcp_v4_rcv
tcp_v4_err
….
TCP
}
Receive Data (cont`)

receiving data through socket
/* fs/read_write.c */
sys_read()
f->f_op->read
/* net/socket.c */
sock_read()
socki_lookup(d_inode)
make msg header
sock_recvmsg()
sock->ops->recvmsg
/* net/ipv4/af_inet.c */
inet_recvmsg()
sk->prot->sendmsg
/* net/ipv4/tcp.c */
tcp_recvmsg()
add_wait_queue(sk->sleep, {current, NULL})
269
tcp_data()
Receive Data (cont`)

Receiving all together (TCP/IP & Ethernet)
/* fs/read_write.c */
sys_read()
VFS
/* net/socket.c */
sock_read()
BSD socket
/* net/ipv4/af_inet.c */
inet_recvmsg()
inet socket
TCP
/* net/ipv4/tcp.c */
tcp_recvmsg()
wake up
/* net/ipv4/tcp_input.c */
tcp_rcv_state_process()
/* net/ipv4/ip_input.c */
sleep
IP
ip_rcv()
/* net/core/dev.c */
/* driver/net/3c509.c */
Device
Linux kernel
el3_interrupt()
270
net_bh()
Conclusion in Network

Add new features
/* fs/read_write.c */
sys_write()
/* net/socket.c */
sock_write()
/* net/ipv4/af_inet.c */
inet_sendmsg()
secure_tcp()
/* net/ipv4/tcp_output.c */
tcp_send_skb()
/* net/ipv4/ip_output.c */
ip_queue_xmit()
compress_net()
virtual_ip()
/* driver/net/3c509.c */
el3_start_xmit()
Linux kernel
271
Conclusion of Linux

abstraction is just a set of data structure in kernel level
 process


struct task_struct
struct user
/* include/linux/sched.h */
/* include/asm-i386/user.h */
 memory

struct vm_area_struct
/* include/linux/sched.h, include/asm-i386/page.h */
struct file, struct inode
/* include/linux/fs.h, ext2_fs_i.h */
 file

 file system

struct super_block
/* include/linux/fs.h, */
 buffer

struct buffer_head
/* include/linux/fs.h */
 device driver

struct device_struct
 IPC
 TCP/IP
/* fs/devices.c, driver/* */
/* include/linux/ipc.h, sem.h, msg.h, shm.h */
/* include/linux/tcp.h, ip.h */
272
Related documents